Tuesday, December 1, 2015

How the Press has Treated Top Tennis Players Over The Years

We all know that not all tennis players are treated equally by the press. Roger Federer is beloved by the media for his professionalism, class and grace. Tennis writers have always appreciated Novak Djokovic's humor and candor, even though he carried a somewhat arrogant persona in his early years. Serena Williams is adored by tennis journalists as of late as she continues to shatter records in the Open Era, but has been demonized for her controversial behavior in the past.

Here at DataBucket, we seek to quantify and visualize things as much as possible. As much as these media perceptions on tennis players are true, we found it interesting to try to quantify the tennis press' sentiment towards top tennis players over time, with the goal of matching our results to some of the ups and downs of each legendary player's career.

To quantify this sentiment, we garnered tennis interview transcripts that were generously available to the public at asapsports.com. Using 1000 interviews on the website, we trained a natural language processing algorithm (specifically a Maximum Entropy Classifier) that classified each interview question as "positive," "neutral," or "negative" (we manually read through a subset, used as training data, and classified each one). Using this classifier, we assigned a score for each interview a top player has conducted - for a positively-toned question, we added the score by 1, and for a negatively-toned question, we subtracted the score by 1.

The results can be found for the top male and female players in the past decade. One can immediately notice that scores near the present are generally much higher than in the past - which suggest that:
  • More questions are being asked in press conferences, and
  • Tennis journalists are less critical to these players as they reach the twilight of their illustrious careers.
We were also able to identify certain events that constituted more extreme scores. Take September 12th, 2009 for example - this was the day Serena Williams threatened a lineswoman for calling a foot fault on her - it has one of her lowest sentiment scores to date. Another example is September 10th, 2011 - the day Federer lost in Djokovic in five sets after seeing the Serb smack forehead winners on his match point. The Swiss maestro publicly voiced his disapproval on Djokovic's "careless" play, which would explain his subpar interview rating on that day. 

Some positive moments were captured too - Caroline Wozniacki had some of her highest ratings during her 2014 US Open Final run (September 2014), and Serena Williams had some of her best ratings during her pursuit of the Career Grand Slam this season (although her SF interview on her loss to Roberta Vinci was much less positive)

We were also able to notice some interesting trends - for instance, journalists in the Indian Wells Masters seem to love Maria Sharapova as her sentiment rating seems to peak in the beginning of March of many seasons (e.g. 2006, 2008, 2011, 2013).

We definitely did not cover all the trends so feel free to play around with our interactive graphic and comment on any interesting findings!

Sunday, November 1, 2015

How Good are NBA General Managers?

In a press conference this week, Lebron James talked about the 76ers' team-building process as they enter a third season with a substandard roster. He said, "It's always a process...You got to build things from the ground up. This year, it's about making a transition." This week, we ask, can we quantify this team-building process?

We've looked at metrics quantifying the clutchness of individual NBA players before, but what about from a team perspective? Is there a way to analyze the quality of a general manager, in making decisions about team selection and player acquisition?


This week, we aim to aquantify team chemistry - how well the players work together as opposed to individually. Then, we map out team chemistry in relation to player retention to see whether GMs can identify when they have good core teams or when they need to mix it up.

To quantify team chemistry, we compared the expected win share of each team (totaled individual win shares of each player on the roster) with the realized win share. We had to calculate expected win share per player as a point of comparison, because actual win shares are already impacted by team chemistry - players will score more points if they are on a team that works well together. Expected win-share would give us a baseline for how players perform on a neutral team.

To calculate expected win share, we used a metric called a similarity score, which quantifies how similar two players' career trajectories are. We used 2013-14 data to calculate these similarity scores, and predicted each player's win-share by the win-shares of the 20 most similar players at this point in their career trajectory, weighted by their similarity score. By using these other players, we hoped to average out the team effect and isolate the individual effect.

By totaling up these expected win-shares and realized win shares for everyone on a team's roster, we could quantify the "chemistry", or the performance above individual expectation, a team had. Then, we plotted this chemistry metric against retention. Retention was also measured based on win share - we calculated the win-share total of a team in one year, then the percent of the win-share that would remain next year after some players were traded away.

Plotting retention against chemistry allows us to classify NBA franchises into four categories, as shown in the table above. Teams with low retention but above-average performance are indications of a newly formed core team, as newly acquired players have figured out how to work together in a short period of time. General managers with a high retention and above-average performance team have found a core group of players that work well together. In these two instances, GMs should not break up their rosters.

On the other side of the spectrum, low retention and below-average performance teams are usually in the rebuilding stage, and GMs should continue looking for a better roster. High retention and below-average performance franchises are signs of teams falling apart in their chemistry, and GMs should seriously consider making some key changes to their roster.

With this in mind, we plotted retention vs. chemistry for all NBA teams in the past 10 seasons. In a quick glance, you will see that the data point are widely scattered, suggesting that some general managers tailor their retention in response to their team's performance, and others do not. Particularly, GMs in the bottom right quadrant should be mixing up their teams due to poor team chemistry, but are not.

The Lakers Are Doing It Right

Retention vs. Performance Above Expectation for the Los Angeles Lakers
A closer look into the Los Angeles Lakers reveals a franchise that really knows how to handle good and bad times. After winning 3 consecutive championships in 2000-2002, the Lakers had a series of mediocre seasons, and reshuffled a lot of its players. After a surprisingly positive 2006 season where they pushed the Phoenix Suns to seven games in the playoffs despite plenty of transfers, they realized they have created a "newly formed core team" (top left in graph).  For the next 4 years, the Lakers kept their main players, resulting in another golden era (2007-2011) where they consistently performed better than expected and two NBA championships (top right in graph). But after a poor 2011-2012 season, where they fell tamely to the Thunders in the playoffs, they realized they were maintaining a poor team (bottom right).  As a result, they decided to rebuild once again, trading and drafting players with the hope of looking for a new core group that can become a championship team. As of 2015, they have yet to find one (bottom left), but the Lakers seem to be sticking to a strategy that has won them 16 championships.

The Knicks Are Not Doing it Right

Retention vs. Performance Above Expectation for the New York Knicks
A team that has not followed this philosophy is the New York Knicks. For many years, the Knicks have disappointed their fanbase, making the playoffs on very few occasions despite paying high salaries to top players like Carmelo Anthony and A'mare Stoudemire. The reason for this is their failure to break up their core group of players despite inconsistent and poor performances. From 2007-2010, NYK teams were situated in the bottom right, showcasing the franchises' reluctance to rebuild. Recently in 2014 and 2015, the team has also been in this quadrant, and most recently had their worst season in history (17-65). This franchise really needs to reshuffle their team in order to be compete for a championship once again.

Is Lebron Right about the 76ers?

Indeed, it seems that the 76ers are quantifiably in a "transition" process, as Lebron called it. Looking at the past three years, they've had performance below expectation, but also low retention rates, indicating that their GM is aware of the fact that he has to shuffle his roster up. This marks a strategy turnaround from years 2007-10, when they appeared to retain heavily in spite of their dismal performance.

What This Means for the 2015-2016 Season

Retention vs. Performance Above Expectation for 2014-2015 Season Teams
We can also use the data to identify teams that are likely to perform well this season. Looking at just the 2015 season, we see that the Warriors, Hawks, Grizzlies all sit firmly in the top right quadrant. For the upcoming season, these teams have a retention rate of at least 65%, suggesting that they are continuing to build on their core teams. However, the Spurs, despite a 90% retention rate last year, are only keeping 58% of this team this season after trading Cory Joseph, Tiago Splitter and Aron Baynes, their lowest retention in a decade. Thus it remains to be seen if the Spurs can truly perform this season despite their reputation as one of the more team-oriented franchises in the league.

Some may argue that acquisition decisions are largely due to luck, and it is very difficult to attribute player performances to the competence of general managers. While this is a valid point, we hope that this data will cast some insight not on whether teams got lucky with the decisions they made, but rather whether they took advantage of favorable situations, such as converting high potential teams (top left) into a high-quality cohesive group (top right).

Wednesday, October 14, 2015

Why Gucci is Losing at Social Media

Luxury fashion houses, such as Prada, Louis Vuitton, and Gucci, are facing a turning point. Even being some of the most recognizable names in the world, some of these brands are facing declining profits. Prada, for example, faced a 28% decline in net profit in the last 9 months of 2014, with a growing reputation of being "outdated" and lacking relevancy. Other up-and-coming brands, like mid-level luxury brand Kate Spade, are becoming praised for their well-curated online content. With the rise of social media, these brands now face an enormous task - being widely recognizable and gaining market share, but also remaining elite and exclusive.


To look at how these brands compare in their social media influence, we turned to Instagram, which has an API to query for limited data. Looking at these brands' number of followers, hashtags, and averages on likes and comments for their 20 most recent posts, we analyze how "with it" each of the top luxury brands truly are.

We can see that in terms of pure numbers, Chanel, Prada, Dior, and Gucci reign supreme in the number of hashtags of those companies' names. A quick search through Instagram branded hashtags indicates that people usually tag pictures of that company's products that they've bought, of its stores, or of products (mainly counterfeits) that they are trying to sell. Having a large number of hashtags indicates wide recognition - this intuitively seems correct. Taking an informal survey of my friends, it seems that everyone - even disinterested males - have heard of these four brands before.

Looking at the number of followers, however, it seems that this hierarchy doesn't hold. In terms of followers, Louis Vuitton ranks number 1. Chanel, Prada, Dior, and Gucci still have a large number of followers, but they are nearly caught up by brands like Michael Kors, which actually does overtake Prada.

Looking at the proportion of tags and followers, we can more clearly see which brands have more loyalty. Louis Vuitton and Michael Kors have a greater proportion, out of the total of all the brands, of followers than they do of tags. This may mean that they have a loyal fan base - getting more interested followers rather than random tags. It may also mean that they have better content, or the other brands have worse content - lots of tags from people liking the product, but less followers because their Instagram feeds are just not that good.

We also look at the average number of likes and comments on each brand's 20 most recent posts.

This graph shows a plot of luxury brands with their average number of likes per post on the x-axis and the average number of comments per post on the y-axis. We draw a regression line to see trends in the relationship between the number of comments and the number of likes, and constrain this regression line to have a y-intercept of 0, since it makes sense that a post with no likes should also have no comments. We also draw confidence bands for this regression line, which is a confidence interval for each predicted y-value, within which sample points should fall if they follow this postulated distribution.

Again, we see that well-followed brands from above like Chanel, Michael Kors, and Louis Vuitton have a high absolute number of likes on their posts. One main outlier is Kate Spade, which has a much higher number of comments per post than predicted. This is indicative of the Kate Spade brand's social media savvy, which they have focused on building as an integral part of their marketing strategy. Kate Spade also has a few posts with an extraordinarily high number of comments because of their giveaways and promotions that depend on social media interaction, such as one in which they raffle off prizes to commenters.

Having a high number of comments in proportion to likes may show that Kate Spade has relatively loyal followers who feel like they're interacting with the brand and want to communicate back. Many of the comments consist of users tagging their friends, recommending them the advertised products.

An outlier on the negative side is Gucci, which has a lower number of comments than predicted.

This may speak to the reputation of Gucci as catering to the middle-aged wealthy, rather than the young and social media-savvy. However, the fact that Gucci had a large proportion of followers, as shown above, but a low absolute number of likes per post, may also speak to poor content on its Instagram page. As shown above, Kate Spade posts are accessible and straightforward (cupcakes, happiness) while Gucci posts are avant garde and harder to understand. Perhaps Gucci strives to maintain its image of absolute luxury and high-fashion and wants to maintain its exclusivity.

A photo posted by CHANEL (@chanelofficial) on

However, comparing Gucci posts to Chanel posts, Chanel being another extremely high-fashion competitor, it seems that Chanel's content is still more accessible and aesthetically appealing than Gucci. Chanel also features popular celebrities like Cara Delevingne and Kristen Stewart while Gucci posts do not leverage the same star power. Perhaps Gucci should revise its social media content to be more traditionally appealing in order to avoid putting off its followers and by promoting more celebrity endorsements and giveaways through social media posts.

Sunday, September 20, 2015

Be Suspicious of NYC Restaurant Health Ratings

Whether you pay serious attention to them, or don't, each restaurant in New York City has a grade posted outside indicating its health grade. Among the 24639 restaurants in the recently published NYC Restaurant Inspection Result Dataset, nearly 80% of the restaurants have been awarded an "A" safety rating, 15% have been awarded an "B" and 5% have gotten "C" or worse.

However, should New Yorkers really trust these health ratings? Ben Wellington of I Quant NY made an compelling case showing that health inspection scores, which help classify restaurants into grades (score of 0-13 is an "A", 14-27 is a "B", 28+ is a "C"), suffer from the "bumping up" syndrome, meaning that restaurants on the cusp of a higher grade tend to be bumped up to the higher grade.
Furthermore, health inspections are made on a random basis annually, meaning that health grades only represent safety conditions in the past year.

Sadly, an examination of the history of health inspection grades of New York restaurants suggest an inconvenient truth - many restaurants considered "safe" today have not always been "safe" in the past. This may mean either that restaurants improve their sanitary conditions significantly after a health inspection, or that inspectors tend to "bump up" restaurant grades from one year to the next.

As seen in the graph above, a majority of restaurants that were rated "B" and "C" in their infancy end up becoming grade "A" restaurants - at an astonishing 72% and 65% respectively. From the opposite point of view, 25% and 15% of restaurants which are now grade "A" restaurants actually started off at a grade "B" and "C" sanitary level. Given that most restaurants in NYC do not have a very long life span (80% of NYC restaurants close after 5 years) and that this grading system was only formally established 5 years ago, having so many restaurant grades increase in such a short time leads us to question whether the letters at the front of every NYC eatery are truly reliable.

Monday, September 14, 2015

Are Women's Tennis Rankings More Volatile than Men's Rankings?

Serena Williams' shocking loss in the 2015 US Open semi-finals left number 26 and number 43 in the world to face off in the finals. Meanwhile, the finals on the men's' side was populated by number 1 and number 2, Novak Djokovic and Roger Federer, respectively. That, in combination with the fact that many female players now far off-the-radar, such as Ivanovic and Jankovic, have been former world number 1, led us back to the question - how volatile are women's' rankings in comparison to men's?

Using weekly ATP and WTA weekly rankings data from Jeff Sackmann's Github, we analyze the variance of the rankings of players currently in the top 30. We also exclude rankings data outside of the top 100 to minimize the variance impact of when these players first became professional, which is not indicative of their pro performance.

Looking at the WTA rank variance of the current top 30 players, we see that as expected, strong players like Serena Williams and Maria Sharapova who rank consistently at the top (excluding injuries) have low mean rank and low rank variance. For mid-tier players, such as Sam Stosur and Roberta Vinci, the variance on the whole becomes much higher.

Smaller circles, which indicate newer players with fewer weeks in the top 100 under their belts - such as Sloane Stephens and Petra Kvitova - have markedly higher mean and variance than the "power cluster" of consistent, top players. However, there are many mid-tier players with many weeks in the top 100, but still large variance and average rank. For newcomers, their ranking behavior is still yet to be determined - they could join either the consistent top players or the varying mid-tier players.

The graph of the top 30 ATP players show that ranking means are similar across men and women, ranking from 5 to 55. However, the variance is lower for men on the whole. Similar to the WTA results, small circles indicating newcomers generally trend to the right and the top of the graph, meaning higher variance and rank. This is due to the fact that these players undergo a lot of ranking movement when they first go pro, which is not indicating of their long-run ranking behavior.

Again, like in the women's results, men's ranking behavior breaks into two camps: the consistent, top players like Roger Federer and Rafael Nadal, and the mid-tier players who vary more, such as David Ferrer and Philipp Kohlschreiber. One surprise is that Novak Djokovic has such a low average rank but such a high ranking variance - Djokovic has sharp rises in the rankings, and variance penalizes that over small incremental increases.

Finally, looking at the graph of WTA rank variance vs ATP rank variance over the years with regards to the current top 30 players, we see that WTA is significantly higher than ATP variance. This is mostly attributed to periods of extreme variance exhibited by certain players, such as Maria Kirilenko and Jamie Hampton. On the whole, however, looking at the individual variances of the top 30 players, women do have higher rank variance than do men.

Saturday, September 12, 2015

The Odds of an All-Italian US Open Final Were Less than 1%

The semi finals of the women's US Open produced two monumental upsets. Just like in the men's final four last year, where two significantly lower ranked players upset the top two seeds, Flavia Pennetta dismissed Simona Halep in straight sets, and Roberta Vinci came for a set down to deny Serena Williams from achieving the first Grand Slam since Steffi Graf in 1988.

FiveThirtyEight declared Serena William's loss as the greatest loss of all time, according to the current Elo Ratings of Williams and Vinci at the time of their semi final match up. That said, we wanted to measure this upset in a probabilistic manner. How likely was it that both Italian players upset the top seeds?

To answer this question, we refer to our tennis prediction model, which also uses an Elo-Rating style metric to calculate a player's ability. However, our system only incorporates matches played in the past year and head-to-head matches between players in the past 5 year. Our method also places more emphasis on detailed tennis metrics such as sets and games won in each match, the court surface being played on, and the stage and quality of tournaments being played. This allows our model to make accurate predictions for any tournament at any given time.

The table above represents each of the semi-finalists chance of reaching each stage of the tournament (Finalist or Winner) before the Friday matches. Notice that the Italians only had a 21% and 3.7% chance of winning their matches. Further analysis of past tennis data (from 1968 to 2015) suggested that semi final match outcomes are essentially independent of each other. Probabilistically, the chance that the second match would be an upset is the same as the chance that the second match would be an upset given that the first match was an upset. Thus, the probability that an All-Italian US Open final would have occurred is 21% x 3.7% = 0.8%.

In terms of who would win the final tomorrow, betting odds have declared Flavia Pennetta a 4/9 favorite, or an implied winning probability of 69.3%. Our model suggests otherwise, declaring Pennetta as merely 54.7%. As our model places more weighting on later stage matches and strength of opponent, Vinci's ability improved much more than Pennetta's, as Williams has a significantly higher rating than the rest of the field. Thus, despite being ranked over 15 places higher than Vinci, Pennetta is only a slight favorite in this final matchup. You can essentially treat this final as a toss up.

Stay tuned for our preview of the men's final on Sunday.

Tuesday, September 8, 2015

Study of the VIX

There is plenty of analysis out there about the stock market. Much of this analysis is on an intraday basis, analyzing how individual stocks moves based on oil prices or geopolitical turmoil. Sometimes these explanations have an obvious correlation to the markets; other times, these explanations are nothing more than educated guesses.

Instead, we're interested in long-term trends. This week, we study the relationship between the VIX index and the S&P 500.

VIX Index 
The VIX index is primarily used as a representation of the market's expectations of the 30-day volatility of the stock market, expressed in percentage points. Specifically, the VIX is 100 times the square root of the expected 30-day variance of the S&P 500 rate of return.
\text{VIX} = 100 \sqrt{\text{var}}
Where $\text{var}$ is annualized expected 30-day variance. The expected 30-day variance is estimated by the forward price of S&P 500 options with 30 days to expiration, $e^{rt}S$  where $S$ is the spot price. The forward prices of S&P 500 options represent the market's risk-neutral expectation of the variance of the underlying.

No arbitrage pricing says that the forward price of variance must equal the forward price of its replicating portfolio. Since holding forward positions in a portfolio do not contribute value to the portfolio at the present, the forward price of variance must equal the forward price of the options. If 30-day options are not available, the VIX is calculated using a weighted average of forward prices of options with expirations close to 30 days.

We can see that the VIX follows the general shape of the S&P 500's forward 30-day volatility, but with a lag of a few days. This indicates that the VIX is good at determining the level of volatility in the next 30 days, but not at predicting large changes in volatility. Moreover, for high levels of S&P 500 forward volatility, such as in the beginning of October 2008 following Lehman's bankruptcy and preceding several DJIA increases and declines, the VIX seems to underestimate the level of volatility in the next 30 days. Generally, however, the VIX seems to remain above the actual S&P 500 volatilities.

The difference between the VIX and the historical S&P 500 volatilities shows points in time where the VIX is significantly lower. These include the end of September 2008 and the beginning of October 2008, which as we mentioned before, included the worst of the financial crisis. These low VIX points also include the end of April 2010, which preceded the May 6 "Flash Crash", a trillion-dollar stock market crash that lasted just minutes. Another dip in the VIX compared to the S&P 500 was at the end of July 2011, which preceded an August 2011 stock market crash due to a US credit downgrade. The last VIX dip in the graph is due to the recent China crisis. These are all points of high S&P volatility in the first graph that the VIX severely underestimates. 

Sunday, September 6, 2015

Stanimal's Title Chances are Worse than You Think

The first week of this year's US Open has been tumultuous - top 10 players Kei Nishikori, David Ferrer, Rafael Nadal and Milos Raonic have all crashed out, and the tournament has had a record 16 retirements. In particular, Jack Sock and David Goffin were leading their matches, only to succumb to the extreme heat and humidity.

Despite all the unpredictability, the two Swiss contenders, Roger Federer and Stan Wawrinka, have reached the second week of the tournament in contrasting fashions. Federer seems to be enjoying himself, toying his opponents with his flashy shot making and half-volley returns, while Wawrinka has somehow escaped from close tiebreak situations, including a seemingly lackluster effort in his match against Ruben Bemelmans.

With that, we were interested in what our tennis prediction model says about the chances of Federer and Wawrinka ending their tournament at each round and compare it to betting odds. Not surprisingly, our odds are fairly similar to the ones provided on betting websites. However, we believe that Wawrinka's chances of ending his run at the QFs are higher (59%) than betting websites (52%). As our model places emphasis on the closeness of each match, the fact that Wawrinka played more tiebreaks, even though he has not lost a set in this tournament, lowers his prospect of reaching the later stages of the US Open. As a result, our odds of him reaching the semi finals, final and winning are significant lower.

On the other hand, our odds for Roger Federer is in line with betting companies, as a result of his masterclass displays in each of his three matches. In fact, our prospects of him losing before the finals is significantly lower than the betting probabilities.

To look at prospects of other remaining players reaching different stages of the tournament, check out our results below. Stay tuned for more updates in the middle of the week.

Tuesday, September 1, 2015

Nishikori's Early Exit does not Improve Djokovic's Title Chance

Upon the conclusion of the US Open's first round matches, many would believe that Kei Nishikori's early exit will open up the draw for Novak Djokovic and improve his title prospects. However, our tennis prediction model suggests that Djokovic's chances of winning remains level at around 55%. Similarly, Federer and Murray's chances stay the same at around 25-26% and 8-9% respectively.

Ultimately, the reason why Djokovic's prospects haven't changed is that he is still likely to face Nadal in the quarterfinals, and Federer or Murray in the final. Furthermore, the top three players in the world (Djokovic, Federer and Murray)  have a combined 90% chance of winning the tournament, while Nishikori's prospect prior to Monday was a mere 3.8%. Should Federer or Murray have suffered an upset in the first round, Djokovic's title chances would have definitely skyrocketed.

Nishikori's early exit also raises an interesting question - which player will emerge from that quarter? Our model suggests that Marin Cilic, the reigning US Open Champion, and not David Ferrer, the highest seed left in that quarter, has the highest odds. This may to due to the fact that Ferrer has not won a match since Roland Garros, and Cilic had recently reached the quarterfinals of Wimbledon.

Despite ranking outside the world's top 40, Benoit Paire has a decent chance (6.4%) of reaching the semi finals. As our model rewards players who pull off upsets, Benoit Paire's rating increased greatly after his first round win over Nishikori, making him the fifth favorite player to come out of this quarter of the draw. Also look out for dark horse Jo-Wilfried Tsonga - while he has dropped to as low as 18th in the world, his semi-final odds are only a tad lower than Ferrer's as a result of his strong showing in Montreal.

Sunday, August 30, 2015

US Open Preview - Djokovic Still Heavy Favorite Despite Recent Losses


After last year's US Open, many began to question whether Marin Cilic's triumph represented the beginning of a transition at the top of men's tennis, from the dominant Big Four to a set of young guns that included Cilic, Kei Nishikori, Milos Raonic, Grigor Dimitrov and more. However, with the exception of Kei Nishikori, who has risen to a career high ranking at number 4, other young guns have yet to challenged the likes of Novak Djokovic, Roger Federer and Andy Murray. Coming into this year's tournament in Flushing Meadows, it seems like these three players are yet again the clear favorites. But what are the odds among these elite men? Is Djokovic still the clear favorite after losing to both players in back to back finals? Will Federer's return-and-charge approach still bode well in a best of five set match in slower hard courts?

To answer these questions, I enhanced the prediction model I made for this year's Wimbledon. In accounting for player's performance in the past year, the model places more emphasis on hard court matches, quality of opposition that players have won or lost to, how close each players' matches were, and the level of tournaments players have participated in. In particular, I put special emphasis on matches played in Washington, Montreal and Cincinnati. History has suggested that players that have performed well in these tournaments have gone on to do well in the US Open. For instance, all players who have won Montreal/Toronto and Cincinnati in the same year have gone on to win the US Open title (except Andre Agassi in 1995). 

Simulating 50,000 tournaments led to the following odds of each players reaching each stage of the tournament:

Despite losing in finals as of late, Djokovic still has a 56% chance of winning the US Open. The 28-year-old Serb is having his best season since his historic 2011 breakout campaign. Given that his recent losses have come in three set matches, Grand Slam matches should give him more time to adjust to opposition tactics and ultimately come out triumphant.

Also interesting to note is that Federer's title prospects far exceeds Andy Murray's. I have frequently dismissed the common belief that Murray has a better chance than Federer, and I will reiterate here. The Swiss Maestro has won his last five encounters against the Scot, including the last ten sets, so the recent Cincinnati champion will definitely be the favorite in their potential semi final encounter.

Other interesting results include Andy Murray having a relatively low chance (48%) of reaching the semi finals. This is likely due to a potential blockbuster match against Roland Garros Champion Stanislas Wawrinka in the quarter finals. Also notice that Rafa Nadal has a surprisingly low chance of reaching the semi finals (6.9%), as he is placed in the same quarter as Djokovic. That said, he is still is the sixth favorite to win the title (1.2%), as Nadal has a tendency to peak at the latter stages. Should he get past Djokovic, his title prospects will skyrocket.

1st Round Betting Recommendations

We also sought to determine the matches that would yield the largest expected gains, according to odds reported from oddportals.com. By converting our probability to odds, we determined expected gains/losses by multiplying the difference in DataBucket odds and OddPortal odds by the probability of winning the bet. The results can be seen below:

Interestingly, betting Marco Cecchinato to win his match against soon-to-be-retiring Mardy Fish would most likely yield the largest gains, as betting companies have claimed Fish to be the favorite in this match. While Cecchinato is ranked outside the top 100, he is definitely the more physically fit player at the moment, and will more likely come out the victor in five-set Grand Slam conditions.

A general trend in these results suggest that betting odds for matches featuring the top players tend to be more aligned with DataBucket odds, while matches consisting of lower ranked player tend to be less aligned. This may be due to supply and demand factors relating to the betting market. Other attractive bets including Lucas Pouille defeating Evgeny Donskoy and Marsel Ilhan defeating Radek Stepanek.

DataBucket will be updating US Open predictions throughout the tournament. Stay tuned for updates.

Sunday, August 2, 2015

How Much Five Guys Costs in Different Parts of New York (and the US)

If you have travelled to different parts of the US, you will probably notice that restaurants do charge customers at different prices. A cappuccino in a Starbucks in Manhattan will most likely be much more expensive than the same cup of coffee in Nebraska. Similarly, a double cheeseburger in a San Francisco McDonalds will be more expensive than the same burger in Colorado.

With that, we wondered - what if we can map out the prices of a large restaurant chain for the entire country?

Obviously, such data is very difficult to find on the web - if the prices of different stores were easily accessible, then some stores will become far more popular than others. Fortunately, Five Guys has an online ordering system that allows customers to pick up their food without having to wait in line, so the prices of their products for all of their stores are available online. With that, we mapped out the prices of various food items on Five Guy's menu for restaurants within 50 miles of the 50 largest cities in America. Check out our interactive graphic below:

(Note: To look at specific areas of the United States, click the search icon on the top left and type in a state or city. Or use the "+" and "-" buttons to customize your view)

Notice that the cities known to be more expensive (think Seattle, San Francisco, Chicago and New York) are where the most pricey Five Guys are located. It is also interesting to see that for most cities, that the prices are largely the same (take a look at Las Vegas, Denver, and the large Texas city)

The most interesting region by far, however, has to be the New York area. A closer look reveals a clear-cut price segmentation strategy. Manhattan is by far the most expensive out of all the boroughs. Stores right across the Hudson are significantly cheaper, with bacon burgers over a dollar cheaper than in Manhattan. A similar phenomenon can be seen across the East River - Queens and Brooklyn restaurants offer bacon burgers that cost over 50 cents less.

We recognize that geographic price segmentation is hard to overcome - if you live in Manhattan and have no other reason to go to Brooklyn, it's not worth it to go just for a few cents savings. Nevertheless, it's interesting to quantify the  magnitude of the Manhattan premium.

Saturday, August 1, 2015

Which School Produces the Most Successful Startup Founders?


With startups becoming a larger and larger segment of the American economy, we looked to answer a question we figured would be relevant to a lot of people: what makes a successful startup?

There are many factors that can define "success" in entrepreneurship; we use money, namely venture funding, as our metric for success. We know that there are many philosophical arguments against this, but use it because of its ability to be quantified. We've already looked at which industries and states generate the most venture funding.

This week, we look at the founders themselves. What characteristics do founders of richly funded startups have in common? What can people looking to found a startup do to optimize their chance of success?

The data we work with this week comes from the Crunchbase API. We look at the top 5000-funded startups over the past 15 years and the education of each of their founders.

Looking at sheer number of founders per school out of this space of top 5000-funded startups, we see:
  1. West coast schools are highly represented. Stanford beats its runner up, MIT, by 43%. Berkeley is also 4th most represented. UCLA, USC, and CalTech closely follow.
  2. Ivy League schools are featured prominently in the top 15 schools, including Harvard, Cornell, Yale, UPenn, Columbia, and Princeton. This indicates that the prestige of the school may have an impact upon attracting venture funding, or that the rigorous level of education may produce capable startup founders.
  3. Business schools (Harvard, Wharton, Stanford) and engineering-focused schools (MIT, CalTech,  both feature prominently. There is no huge divide between the "egg-heads" and "talkers" - both are valuable to the startup industry.
  4. There is a large number of international universities on the list, including a noticeable number of graduates from Tel Aviv University, which is 7th on the list. This is attributed to its entrepreneurial culture.

We also looked at the average number of college degrees attained for each startup versus its amount of funding received. There did not appear to be a strong correlation between earning more degrees and the amount of funding received - the amount of funding looked relatively consistent across different numbers of degrees received on average.

In terms of average amount of funding graduates from each school gets, Harvard, MIT, and Stanford get a standard amount of funding. Indian Institute of Technology has a disproportionately high average funding as well as a large number of founders. Hanzhou Normal University and Zhejiang University of Technology are off the charts for average funding received. This is attributed completely to Jack Ma and Eddie Wu founders of Alibaba.

So what's the secret?

To best prepare for founding a successful startup, it appears that founders get a boost from graduating from prestigious universities, west coast universities, engineering-specialized universities, and/or business schools. The number of degrees does not really matter, and there are always rarities coming from relatively unknown colleges, but receiving a spectacular amount of funding.

Saturday, July 11, 2015

How DataBucket's Wimbledon Model Can Be Better

This year's Wimbledon tournament culminates tomorrow in a final blockbuster showdown between Novak Djokovic and Roger Federer. Over the past two weeks, we have developed a probability model that determines the odds of each player reaching each stage of the tournament (these odds were updated after every round). Now with two competitors remaining, our model claims that Novak Djokovic has a 63.8% chance of beating Roger Federer.

But should you trust our model's results?

Upon inspection of my model assumptions, there are areas where our probability model can be improved.

Match Scores are Not Always an Accurate Indicator of Current Form

Here's one of the key concerns: while our model accounts for how close a match was (e.g. winners in straight sets are rewarded more than winners in 5 sets), match scores are not always a true indicator of form or current ability. Roger Federer was rewarded significantly for beating Andy Murray in straight sets, especially since he only had a 55% chance of winning, but in my opinion he should be awarded more. Federer faced only one break point in the entire match (and that was in the first game). He hit 20 aces, had over 5 winners for every unforced error he made, and won over 80% of the time when serving at 30-30 or deuce. Legends of the games proclaimed it as one of the best serving performances ever witnessed. Even Federer acknowledged this match as "definitely one of the best matches I've played in my career."

Likewise, Andy Murray should not be penalized as heavily for losing this match in straight sets. He hit over 2 winners for every unforced error he made (the average for the tournament was 1.5). He served a respectable 12 aces compared to only 1 double fault. And he managed to stay with Federer to the end of each set, only for his opponent to step up a gear. This is not a demoralizing defeat on Murray, but rather a performance many would call a valiant effort. As Sports Illustrated cited in their live blog, "So good. Too good. Too, too good from Roger Federer." My model can be improved by incorporating some of these detailed match statistics, but how much they should influence these probabilities is very much up for debate.

A Career-Defining Win Can Go Many Ways

We all would probably agree that this match is one of the highlights of Roger Federer's already illustrious career. We can classify this match as one of his career-defining matches. But such a strong performance from Federer can go either way. He may gain plenty of momentum from this performance and play superbly against Djokovic in the final. Or he may expend too much energy and suffer from mental or physical fatigue and fall to the steady Serb. This is what our model also lacks - the ability to capture a player's reaction to a career-defining win. Will a player succumb to the pressures of playing the match of his life in the next match, like Lukas Rosol after beating Djokovic in 2012 Wimbledon or Kei Nishikori after beating Djokovic in 2014 US Open? Or will a player rise to the occasion, gain confidence and play at a much higher level after a career-changing win, like Robin Soderling in 2009 French Open, or Stanislas Wawrinka in 2014 Australian Open?

For these reasons, while listing Djokovic as a 63.8% favorite seems reasonable to most, there are just many factors in tennis that are difficult to quantify. DataBucket will continue to try to incorporate as many of these factors as possible, especially as the US Open is just around the corner.

Tuesday, July 7, 2015

Wimbledon QF Preview: Top 4 Seeds Likely to Advance

We are now down to the quarterfinals of Wimbledon, and the top four seeds are still going strong. In fact, according to my probability model, the chances of Djokovic, Federer, Murray and Wawrinka making the semifinals are very high. In my previous 4th round post, each of these players had at least 67% chance of making it to the final four. Now each player has at least a 78% chance.

With these results comes a few interesting observations:

Djokovic's Chances of Winning have Decreased: Before the start of Wimbledon, Djokovic had a 56% chance of winning - this has decreased to 52%. As my model accounts for each player's margin of victory, Djokovic's close five-set encounter against Kevin Anderson actually hurt his chances of winning the tournament.

Wawrinka's Title Prospect Continue to Rise: Before the tournament, my model predicted Stan the Man's winning odds to be at 6.2%. Now it has climbed to 9.1%, as he has breezed through the early rounds without dropping a set. We had predicted Wawrinka to have only a 66% chance of reaching the QF. Now that he has reached this stage, he will only become a more dangerous threat.

Federer is More Likely to Win Wimbledon Than Murray: I had argued this point in my previous post, but Federer's performance at Wimbledon have continued to put him in front of Murray (Bet365 apparently disagrees). Federer easily handled Bautista Agut on Monday, while Murray fought through a tough four-set encounter against big-serving Ivo Karlovic. Murray also has not beaten Federer or Djokovic since 2013.

Gasquet's Title Odds are Overrated (It's not 2%): Sure, he has one of the prettiest backhands in the world. Sure, he beat Dimitrov and Kyrgios (who was overrated anyway). But he has to beat Wawrinka, Djokovic and Federer along the way and these are clearly very tough hurdles to overcome.

To see how my model works, take a look at my original post on predicting the Wimbledon tournament. Any comments are welcome!

Sunday, July 5, 2015

Djokovic Still the Favorite After the 1st Week of Play

The Wimbledon has dwindled down to its final 16 contenders. While some of the top seeds have fallen (think Raonic, Nishikori, and Nadal), Djokovic, Federer, Murray and Wawrinka are still in contention.

Last week, I presented my model that predicted the odds of each players reaching different rounds of the tournament. After three rounds of play, the odds of the contender haven’t changed significantly (Djokovic at 57%, Federer at 18%, Murray at 13%, and Wawrinka at 8%). That said, the chance of these players reaching the semifinals has increased dramatically. Mark your calendars down for some mouthwatering clashes on Thursday and Friday, as the top 4 seeds all have at least a 67% chance of making it to the final four.

Comparing our results with betting odds, we claim that Bet365 appears to have overestimated the chance that the underdogs will win the tournament. They may be making this move on purpose to hedge against the risk of paying out huge multiples, or they are wary of the fact that three of the past six slams have been won by a member not in the Big Four. But giving Nick Kyrgios a 1 in 29 chance of winning is far too optimistic given that he will have to beat Djokovic, Wawrinka and Federer/Murray to win the tournament.

Another overly optimistic implication betting companies have made is giving Murray a 1 in 3 chance of winning the tournament. Yes, many people are probably betting on Murray. Yes, many people think homecourt advantage matters. But keep in mind that Murray is 2-6 in Grand Slam finals and has lost to Federer or Djokovic 11 times in a row. There are also lapses in concentration, such as in the third set against Andreas Seppi, that is worthy of concern and may come back to haunt him when he plays Federer or Djokovic.

Thoughts on my model? Thoughts on Wimbledon in general? Feel free to comment below. Check out the posting on the first week of Wimbledon here.