Tuesday, May 26, 2015

Two Graphs About the Future of Startup Funding

Since the turn of the millennium, startups have taken the world by storm. In the past, building a company required bound and committed investors. Nowadays, any college student with an idea can potentially become the CEO of a billion-dollar company. 

With this evolving landscape, we were most interested in discovering trends in venture funding. Knowing this information could help startup founders develop strategies to capture venture funding in the future. 

Using datasets from CrunchBase drawn from Github, we address three questions. 

1. How have Industries Invested in Changed Over Time?

Below, we map the top 14 funded industries by number of rounds of funding given. We do this for four different years, choosing 2000 as a starting point, 2001 to see the effects of the dot-com bubble burst, 2007 to see the effects of the financial crisis, and 2014 as our ending point.

We see several trends between 2000 and 2014:
  1. Proportion of funding for Biotechnology steadily increases between 2000 and 2014. In 2000, biotechnology doesn't crack the top 14 industries, but by 2014, it represents the second largest sector, behind software. 
  2. Despite the dot-com bubble bursting in 2000, software has remained at around 20% throughout the past 15 years. 
  3. Mobile startups gain a larger share over time, but this share does not grow as much as we expect (7% until 2007, 9% in 2014). Mobile-focused startups do not monetize as well as other startups do, and thus, get less validation through venture funding as well. 
  4. Health Care broke into the top 14 sectors around 2007. Startups are beginning to break into the healthcare industry due to disruptive legal reforms, an aging population needing more care, and the development of new health technology.
  5. The proportion of funding taken by Advertising startups declined between 2007 and 2014. This is because of the purported popping of the Ad-Tech Bubble and the fact that the industry is already dominated by two advertising giants: Google and Facebook
2. Which States are Most Active in Funding Startups?
We visualize the average funding amount and number of companies for each state and funding round (Series A to E+) in the maps below.

Here are some noteworthy points:
  1. As expected, California and New York do well, receiving high funding amounts as well as rounds by many companies. They deserve their titles as start-up hubs.
  2. That said, start-ups can spawn anywhere. South Dakota, Iowa, and Kentucky had one round of Series B, C, D funding respectively, but their amounts were impressive.
  3. Illinois has an outstanding record of producing large-scale companies, producing 18 rounds of Series D funding at a staggering amount. 
  4. Other start-up hot beds include Florida, Texas and Massachusetts.
3. How does this help you?

Looking at these results, it appears that recent trends in startup funding include the growing popularity of industries that benefit people in tangible ways, such as Biotechnology and Clean Energy. However, on a short-term basis at least, it seems that there is a lot of stability in the funded industries - staples like Software, E-commerce, and Enterprise Software remain around the same year to year. Noticeable trends such as the popularity of Biotechnology are largely policy-based: healthcare reforms are opening up new doors for startup founders to step in and create innovation. 

The map shows that California is still at the forefront of startup funding, although New York is gradually building a reputation for venture capital activity. Other states, though producing a few high-value investments, don't come close to California or New York.

Code is included in: https://github.com/barbaraz/DataBucket/tree/master/Crunchbase_Project

Monday, May 25, 2015

Classifying Types of Players on the PGA Tour with Clustering Methods

Classifying athletes is a relatively intuitive task for team sports. In basketball, there are the traditional point and shooting guards, small and power forwards and the center. In soccer, there are goalkeepers, defenders, midfielders and forwards.

But how about individual sports such as golf? The PGA Tour in the last several years has seen emerging players becoming more athletic and nimble, hitting the golf ball like never before. Stars such as Bubba Watson, Rickie Fowler and Rory Mcilroy have revolutionized the game by hitting 300+ yard drives, allowing them to hit short iron approaches and increase their chances of scoring low.

That said, the veterans are not going anywhere. Jim Furyk, Ernie Els and Jimmy Walker utilize accurate play and shrewd course management to keep up with the young guns. This creates a competitor tour consisting of players with distinctly contrasting styles.

Can we quantitatively capture these different styles? Using 49 metrics from pgatour.com that cover driving, approaches from fairways and roughs, scrambling, putting and more, we seek to classifying 2014 PGA Tour players using two clustering methods - K-Means Clustering and Hierarchical Clustering.

K-Means Clustering

K-means clustering is a type of unsupervised learning algorithm, a set of methods that extrapolate information about unlabeled data. K-means clustering seeks to segregate the data into K parts such that the variances within the K regions are minimized. In this case, we are seeking to segregate PGA Tour players into groups based on metrics that indicate their playing styles, without actually having labeled data about what kind of style each player is.

A key consideration is determining the number of clusters (K). Usually, this is done by plotting the within-group sum-of-squares (WSS) with the number of clusters. As the number of clusters increases, WSS should decrease. It will decrease more if a segmentation creates a closely-knitted region, and less if it fails to do so. As a result, we choose K such that expanding the number of clusters to K+1 would have an insignificant change on WSS. Graphically, this is when the slope of our plot generally becomes flatter. Looking at the plot below, this occurs at K = 4, so we grouped the PGA Tour players into four groups.

The heat map provides an overview of the characteristics of the four groups we clustered using K-Means. Red represents below average values, and blue represents above average values. By looking at these values, we can essentially describe the 4 groups as follows:

  1. The Elite Group - These are the players that have the complete game. Most of these players consist of young players that not only have amazing power, as shown by their superior Par-5 performance (dark red) and driving distance (dark blue), but also elite short game (most of the putting and scrambling metrics are blue for this group). Not surprisingly, this group consists of Bubba Watson, Rory Mcilroy, Adam Scott and Dustin Johnson.
  2. The Average Group - These are the players that are mediocre and steady in all statistical categories. There are no extreme colors throughout this entire row, except for blues in Driving Accuracy and Consecutive Fairways. This is a testament to the consistency of players in the group. Examples of players include Brandt Snedeker, Bo Van Pelt, Graeme McDowell, and Henrik Stenson.
  3. The “I Make it Up On the Greens” Group - These are players who do not perform too well on the tour, mainly due to their poor driving and approach abilities (they are red among the driving and approach metrics). Fortunately, they make it up with their ability on and around the greens, as their scrambling and putting performances are all in the blue. Prominent players here include Ian Poulter, Lee Westwood, and Ernie Els.
  4. The “I Suck on the Greens” Group - These are players who do fine on the fairways but poorly on the greens. They are slightly red in fairways hit, but when it comes to putting and scrambling metrics (putts/round, scrambling, putts 5/10 ft) these players do poorly, which explains why this group performs the worst. Louis Oosthuizen, Martin Laird and Davis Love III belong to this group in 2014.

Hierarchical Clustering

Instead of segregating the dataset from the top up, hierarchical clustering takes a bottom-up approach. Single data points merge with adjacent data points to form clusters, and then continue to merge with closest clusters until all data is merged into one cluster. This creates a tree that maps how different players group together. 

There are different ways of determining the “closest” cluster. The “Complete” method measures the furthest distance between elements in each cluster. The “Ward” method measures the increase in sum-of-squares should two clusters merge. The “Mcquitty” method specifies the distance of a new cluster to another cluster to be the average of the pre-merged clusters to that cluster. We found that the Ward method gave us the most interesting, interpretable results.

To be consistent with the 4 groups we found through K-means clustering, we examined the 3 highest layers of the hierarchical clustering trees to get 4 subgroups for each hierarchical clustering method.

The Ward Method

The Ward method leads to a more balanced partitioning of players. That said, the green group in the heat map below is still the “residual” group, consisting of players without many standout features. The other three groups can be described as follows:

  1. The black group describes players that drive the ball short and accurately, have a solid game around the greens through strong scrambling metrics. Lee Westwood, Luke Donald and Ian Poulter belong to this partition.
  2. The blue group consists of poor performers that are average off the tee, and are lackluster around and on the green. They have poor scrambling percentages and in particular, putting within 10 feet. This is similar to the “I-suck-on-the-greens” group.
  3. The red group contains elite players that have very strong scoring performances. In particular, their approach game is stellar, with close proximity to hole with their iron shots. Top players like Rory Mcilroy, Jason Day and Jordan Spieth belong here.

So what do all these clustering results say?

Though the sizes of these groups differ among different clustering methods, there seems to be three groups of players that are consistently identified:

  1. The elite stars - The young guns that have established themselves at the top of the game by combining powerful tee shots with a superior short game.
  2. The consistent lads - The established veterans who don’t hit the ball far, but still perform well on the tour with a fine-tuned short game.
  3. The short game newbies - The players that hit their tee shots well, but have performed poorly in 2014 due to their inability to approach and handle the greens well.

It will be interesting whether these clusters remain the same at the end of the 2015 season. Also interesting to note is that Tiger Woods was not part of our analysis due to his injuries in 2014. Which group would he belong in 2015 given his subpar performances so far? This remains a question to be answered at the end of the year.

Tuesday, May 19, 2015

How likely are YOU to go to jail?

In the past month, we analyzed the performance of NBA Players in clutch moments, soccer players before and after transfers, and tennis players when they serve first or second. This post, we move away from sports and decided to look into US incarceration rates. More specifically, we were interested in understanding how one’s name, gender and age may correlate to his or her probability of being in jail.  Would an “Adam” right out of college be more likely to go to jail than an “Adam” that has been in the workforce for thirty years? Do names that hint towards a particular ethnicity, such as “Jose” or “Mohamed,” lead to a higher or lower incarceration rate? Using available jail and census data, we estimate the probability that a person with any name, gender and age will go to jail.

The Methodology

We decided to calculate two probabilities:
  • Probability that one will go to jail given first name, age and gender:
  • Probability that one will go to jail given last name, age and gender:
To calculate these metrics, we utilize an intuitive probability rule, which states that the chance two events A and B occur is equivalent to the probability that A occurs given B occurs multiplied by the probability that B occurs:

Solving for the conditioned probability yields a fraction known as Bayes’ Rule:
Applying these equations to our two metrics yield the following equations:

The numerator data was calculated using data of inmates situated in Rhode Island, Mississippi and Orange Country, totaling nearly 200,000 prisoners, as well as current data on the number of inmates in the United States (more than 2 million). The denominator data was calculated using census data and social security administration data.

Notice that the denominator of the last name metric is not conditioned on age or gender. This is because we assume that last name frequencies are independent of gender and age. We also couldn’t find data that provided the conditioned probability, which further motivated this simplification.

Our Findings on First Name

Our findings indicated certain first name probabilities that were surprisingly high. Upon inspection, we realized that most of these outliers are very uncommon names that only occur in my inmate dataset a countable number of times. For example, “Khari”, “Anibal” and “Rudolfo” had the highest incarceration probability for males in age groups 20-29, 40-49 and 60-69 respectively in our data. These are names that are not well known, so in our method, the probability of jail given those names is high. That said, there are some common that top our list. “Frederick” and “Antonio” top the 10-19 and 80-89 men age groups,  and “Jeanine” tops our 70-79 female age group. 

We found it more interesting to rank the top 100 most popular first names in each age category and rank them by their imprisonment rate. The top 20 names out of these 100 most popular first names with highest incarceration rates for each age group can be found in the table below:

Our Findings on Last Name
The top 20 last names with highest rates for each age group can be found in the table below:



Again, we face the problem in which our data has a few occurrences of uncommon last names. Thus, the probability of going to jail given those last names is especially high, since the probability of having those last names comprises the denominator in our method. For last names, however, we did not want to constrain ourselves to the 100 most common last names because we felt that it would detract from the diversity of the names. Perhaps for future studies, it would be best to constrain ourselves to the top 1000 names or so on, in order to capture diversity but also limit the denominator effects of extremely uncommon last names. 

Probability of Going to Jail
First Name:
Last Name:
Gender: M F

*Note: Some browsers do not work well with this form. Refreshing the page, then pressing the Calculate button again should work.

In case the form above does not work, here are the links to the spreadsheets with the jail probabilities in decimal form (0.1 = 10%). The columns are ages 10-19, 20-29, 30-39, etc. with the last category being age 100+.

Male First Name
Male Last Name
Female First Name
Female Last Name

Some of the scraping scripts are on Github

Thursday, May 7, 2015

How do soccer players perform after transferring to different teams and leagues?

High-profile transfers are common in European football - every year, record-breaking transfer fees are offered between teams for the best players in the world. Cristiano Ronaldo, Gareth Bale, Luis Suarez are just three of many superstars in the past few years that have made moves that stunned the football world.

But are these sky-high transfer fees justified for a cross-league move, with no concrete performance metrics in that league to back it up? If the player cannot adjust to the style of play, the rigor, or even the language of the new team, the transfer could render an otherwise superstar player ineffective. After all, players like Juan Sebastian Veron and Karim Benzema have made bold cross-league transfers, only to never capture the form they once had.

By collecting data from soccerway.com, we examine how players perform before and after transferring from teams in the Premier League, Bundesliga, Serie A and La Liga. Through this analysis, we can see patterns in players’ performance as they move from one specific league to another, and use that information to make smarter decisions about cross-league transfers in the future.

Goals Per 90 Minutes

The first metric we use to measure player performance before and after a transfer is goals per 90 minutes. We average together this metric across all players moving from one league to another league before and after the transfer to get a general idea of how that move affects their playing.

(Note - Rows represent the country players are transferring from and columns represent the country players are transferring to)

We see a few interesting observations here:

  1. Not surprisingly, performance of players vary much less when they transfer domestically as opposed to internationally. This supports the notion that players experience markedly different styles in different leagues.
  2. All players appear to have more success in terms of goals scored when they transfer to Germany. This is interesting and also surprising, because Serie A is ranked the worst among the four leagues, according the the latest UEFA Country Coefficients, so it would make sense for players to score more when moving to Serie A from other leagues. That said, Bundesliga players seem to score more goals when they transfer to the English Premier League. This may be due to the quality of players transferring from Bundesliga, suggesting that only the most elite players from Bundesliga join the EPL.
  3. EPL players seem to enjoy plenty of success when they transfer to La Liga, increasing their goal rate by over 60%. Suarez and Bale have scored less since joining La Liga, but apparently these are exceptions. In general, players in EPL enjoy success in all other leagues, justifying the league’s status as one of the more physical and defensively oriented organizations around.
  4. While most players tend to improve their goalscoring prowess after transfers, the most significantly decrease in performance comes when players move from the Bundesliga to La Liga. Spain’s premier division is rated the highest in the UEFA rankings, so it is no surprise that the Bundesliga, which only has one truly elite team in Bayern Munich, struggle to cope with the quality of La Liga.

Minutes Played Per Game

(Note - Rows represent the country players are transferring from and columns represent the country players are transferring to)

This metric calculates the number of minutes of a game a player is on the pitch. We see a few interesting results:

  1. In general, minutes played increases for most players after a transfer. This is an encouraging sign, as it means that teams are making use of their purchases in meaningful ways.
  2. Players seem to find the largest increase in playing time after transferring to the Bundesliga, or when they leave the English Premier League, which may shed some insight on the competitiveness of each league. As mentioned before, EPL has more competitive and higher quality teams, compared to Bundesliga, with the exception of Bayern Munich. To reinforce this observation, Serie A players moving to the Bundesliga increase their playing time increase by over 35% (from 50 to 68 minutes per game). EPL players see their minutes increase by nearly 20% if they move to Serie A (50 to 60 minutes per game).
  3. That said, our dataset contains a significant amount of variance, large enough for all of these changes in minutes played to be deemed “insignificant” by hypothesis testing. To further speculate on any of these insights, more transfer data dating up to at least a decade back would be necessary to make any conclusive observations. (Right now, we have data back to 2011)

The above transfer data seems to support the idea that each league has a different playing style, judging by the comparison between domestic transfers and international transfers. The data also shows certain trends, like players scoring more when they move to Germany, regardless of where they come from. 

However, surprising trends like players not scoring a lot more when moving to Italy even though Serie A is ranked last out of the four leagues should be considered with context - Serie A's richest team is only 8th on the list of most valuable teams in the world, following many other teams in the other three leagues. Thus, the data may show that players moving to Italy do not score more because Serie A teams cannot afford the superstar international transfers that the other three leagues can.

Monday, May 4, 2015

Does Serving First Matter in Tennis?

In every tennis match, the strategizing begins before a single tennis ball is hit. A coin toss is held before the match, and the winner is given the choice between serving first or returning first.

* There are more options, but most people choose to choose between serve and return. See http://www.usta.com/Improve-Your-Game/Rules/Serving-and-Receiving/Winning_the_toss/ for more detail.
Most players choose to serve first because it gives them control of the first game of the match. There is, however, one major exception - Rafael Nadal, one of the most superstitious players in the ATP World Tour, always chooses to return first, regardless of whether he is playing a qualifier or a top 10 player.

In the 2009 Madrid Masters final, Federer took advantage of Nadal’s habit of returning serve by choosing to return first after winning the coin toss, forcing Nadal to serve first. Federer went on to win the match 6-4, 6-4.

Superstition aside, does serving first in a match really affect the chances of winning the match? It also begs the question, does serving first in a set substantially affect the chances of winning the set? Using
point-by-point statistics of every match in Grand Slam tournaments from 2011 to the present, we compare the probability of player winning when they serve first versus when they return first. We also conduct hypothesis tests to determine whether serving or returning first significantly affects the chances of winning the set or the match.

Data Analysis Process

Our point-by-point statistics included over 500 players who had played in any of the four Grand Slam tournaments from 2011 to the present. For the top 20-ranked ATP tour players, we plotted their probability of winning a match when returning first versus their probability of winning a match when serving first.

We see a couple interesting observations:

  1. As expected, most players have a higher probability of winning the match if they serve first. As players win around 70% of their service points, serving first tends to give them a slight lead throughout the set and the psychological upper hand.
  2. Heavy-servers such as Kevin Anderson, Feliciano Lopez, and Milos Raonic have a higher probability of winning a match if they serve first. Gael Monfils and Gilles Simon, who are known to be good returners, have a higher probability of winning given that they return first. These players tend to put pressure on the person who serves first by initiating long rallies from the outset.
  3. Rafael Nadal, who prefers to return first, actually has a higher match-win probability if he serves first. Perhaps he should change his coin-toss strategy and take the first serve more often. Similarly surprising is John Isner, who wins more often when returning first even though he has one of the best serves on tour. Andy Murray, known to be a strong returner, actually wins more often when he serves first.

We also plotted, for the top 20 ATP players, the probability of winning a set given that they serve first and given that they return first.

We notice that:

  1. The majority of players actually have a better probability of winning a set given that they return first. This is because winning one set usually means returning first the next set, since most sets are won by holding serve, not by a service break in the last game. Consequently, in each match, the person returning first in a set more often is likely to be the stronger player, and is more likely to win the set. Thus, the results we see above are mostly circumstantial. Looking at probability of winning given that a player serves first is more useful for matches than for sets, because players do not have a lot of control over who serves first in a set, causing the result that we see in this graph.
  2. Notable exceptions: Kei Nishikori does better in a set when serving first than when returning first, even though he is known to be a good returner. Rafael Nadal, again, does better when serving first than when returning first. This may be because regardless of the effect discussed above, Nadal and Nishikori still do just as well when serving first in a set against top players as they do when returning against weaker players.

We then analyzed the difference between win probability given serving first versus returning first using hypothesis tests. We found that the difference was not as extreme in both the match and the set case for all players as for the top 20 players, but for all scenarios, the results were statistically significant.

Code included in Github: