Monday, May 25, 2015

Classifying Types of Players on the PGA Tour with Clustering Methods
Classifying athletes is a relatively intuitive task for team sports. In basketball, there are the traditional point and shooting guards, small and power forwards and the center. In soccer, there are goalkeepers, defenders, midfielders and forwards.

But how about individual sports such as golf? The PGA Tour in the last several years has seen emerging players becoming more athletic and nimble, hitting the golf ball like never before. Stars such as Bubba Watson, Rickie Fowler and Rory Mcilroy have revolutionized the game by hitting 300+ yard drives, allowing them to hit short iron approaches and increase their chances of scoring low.

That said, the veterans are not going anywhere. Jim Furyk, Ernie Els and Jimmy Walker utilize accurate play and shrewd course management to keep up with the young guns. This creates a competitor tour consisting of players with distinctly contrasting styles.

Can we quantitatively capture these different styles? Using 49 metrics from that cover driving, approaches from fairways and roughs, scrambling, putting and more, we seek to classifying 2014 PGA Tour players using two clustering methods - K-Means Clustering and Hierarchical Clustering.

K-Means Clustering

K-means clustering is a type of unsupervised learning algorithm, a set of methods that extrapolate information about unlabeled data. K-means clustering seeks to segregate the data into K parts such that the variances within the K regions are minimized. In this case, we are seeking to segregate PGA Tour players into groups based on metrics that indicate their playing styles, without actually having labeled data about what kind of style each player is.

A key consideration is determining the number of clusters (K). Usually, this is done by plotting the within-group sum-of-squares (WSS) with the number of clusters. As the number of clusters increases, WSS should decrease. It will decrease more if a segmentation creates a closely-knitted region, and less if it fails to do so. As a result, we choose K such that expanding the number of clusters to K+1 would have an insignificant change on WSS. Graphically, this is when the slope of our plot generally becomes flatter. Looking at the plot below, this occurs at K = 4, so we grouped the PGA Tour players into four groups.

The heat map provides an overview of the characteristics of the four groups we clustered using K-Means. Red represents below average values, and blue represents above average values. By looking at these values, we can essentially describe the 4 groups as follows:

  1. The Elite Group - These are the players that have the complete game. Most of these players consist of young players that not only have amazing power, as shown by their superior Par-5 performance (dark red) and driving distance (dark blue), but also elite short game (most of the putting and scrambling metrics are blue for this group). Not surprisingly, this group consists of Bubba Watson, Rory Mcilroy, Adam Scott and Dustin Johnson.
  2. The Average Group - These are the players that are mediocre and steady in all statistical categories. There are no extreme colors throughout this entire row, except for blues in Driving Accuracy and Consecutive Fairways. This is a testament to the consistency of players in the group. Examples of players include Brandt Snedeker, Bo Van Pelt, Graeme McDowell, and Henrik Stenson.
  3. The “I Make it Up On the Greens” Group - These are players who do not perform too well on the tour, mainly due to their poor driving and approach abilities (they are red among the driving and approach metrics). Fortunately, they make it up with their ability on and around the greens, as their scrambling and putting performances are all in the blue. Prominent players here include Ian Poulter, Lee Westwood, and Ernie Els.
  4. The “I Suck on the Greens” Group - These are players who do fine on the fairways but poorly on the greens. They are slightly red in fairways hit, but when it comes to putting and scrambling metrics (putts/round, scrambling, putts 5/10 ft) these players do poorly, which explains why this group performs the worst. Louis Oosthuizen, Martin Laird and Davis Love III belong to this group in 2014.

Hierarchical Clustering

Instead of segregating the dataset from the top up, hierarchical clustering takes a bottom-up approach. Single data points merge with adjacent data points to form clusters, and then continue to merge with closest clusters until all data is merged into one cluster. This creates a tree that maps how different players group together. 

There are different ways of determining the “closest” cluster. The “Complete” method measures the furthest distance between elements in each cluster. The “Ward” method measures the increase in sum-of-squares should two clusters merge. The “Mcquitty” method specifies the distance of a new cluster to another cluster to be the average of the pre-merged clusters to that cluster. We found that the Ward method gave us the most interesting, interpretable results.

To be consistent with the 4 groups we found through K-means clustering, we examined the 3 highest layers of the hierarchical clustering trees to get 4 subgroups for each hierarchical clustering method.

The Ward Method

The Ward method leads to a more balanced partitioning of players. That said, the green group in the heat map below is still the “residual” group, consisting of players without many standout features. The other three groups can be described as follows:

  1. The black group describes players that drive the ball short and accurately, have a solid game around the greens through strong scrambling metrics. Lee Westwood, Luke Donald and Ian Poulter belong to this partition.
  2. The blue group consists of poor performers that are average off the tee, and are lackluster around and on the green. They have poor scrambling percentages and in particular, putting within 10 feet. This is similar to the “I-suck-on-the-greens” group.
  3. The red group contains elite players that have very strong scoring performances. In particular, their approach game is stellar, with close proximity to hole with their iron shots. Top players like Rory Mcilroy, Jason Day and Jordan Spieth belong here.

So what do all these clustering results say?

Though the sizes of these groups differ among different clustering methods, there seems to be three groups of players that are consistently identified:

  1. The elite stars - The young guns that have established themselves at the top of the game by combining powerful tee shots with a superior short game.
  2. The consistent lads - The established veterans who don’t hit the ball far, but still perform well on the tour with a fine-tuned short game.
  3. The short game newbies - The players that hit their tee shots well, but have performed poorly in 2014 due to their inability to approach and handle the greens well.

It will be interesting whether these clusters remain the same at the end of the 2015 season. Also interesting to note is that Tiger Woods was not part of our analysis due to his injuries in 2014. Which group would he belong in 2015 given his subpar performances so far? This remains a question to be answered at the end of the year.

No comments:

Post a Comment