Wednesday, April 27, 2016

Modeling Taxi Pickups in New York City

I recently learned about Bayesian models from data science lectures online, so I thought - what better way to reinforce concepts by applying what I learned to a real-world dataset? It also just happens that I have always been fascinated by transportation dynamics, and how complex systems such as subway networks, freeways, buses, and taxis all worked. Given that New York City has publicly accessible data on all yellow cab rides, I thought it would be cool to model taxi pickups in Manhattan over the course of a week.  In particular, it would be even more interesting to breakdown taxi pickups by regions of Manhattan to see how they differ.

Choosing the Model

To model taxi pickups by region and time of week, we need to allow some level of flexibility (i.e. allowing the distribution of pickups to be different across regions), but also some level of structure (i.e. not allowing the distribution of pickups to be that different across a particular hour of the day such as weekday rush hours).

The two modeling extremes can be classified as the pooled and the unpooled method. The pooled method assumes that taxi pickups across all regions and times follow the same distribution. In other words, the model consists of parameters that govern the entire city across all times. Clearly, this is not optimal as it does not allow enough flexibility. On the other hand, the unpooled method assumes that taxi pickups across all regions and times have their own individual distribution. This is not optimal either because it does not restrict certain times of the day to have somewhat similar pickup behavior.

A happy medium between these extreme is the partial pooling method, which can be modeled in a hierarchical fashion. In this instance, taxi pickups across different regions and times have different distributions, but the parameters of those distributions derive from the same distribution. This allows behavioral differences, but also enforces a more rigid structure with regards to how distributions can differ.

In choosing the hierarchical structure, we notice that the taxi pickup distribution is very similar among weekdays and weekends, but not between the two. Weekdays appear to have two peak periods, while weekends seem to have only one.

Weekday Taxi Pickup Distributions (Monday and Thursday) are Similar

Weekend Taxi Pickup Distributions (Saturday and Sunday) are similar, but different than weekdays

With this observation, we formulate the Bayesian Hierarchical model as follows:

Here, taxi pickups (represented as X) across time and regions (expressed as h and c respectively) are based on different Poisson distributions, where the parameters for each hour of the week come from a similar Gamma distribution (as shown in the beta parameter). The beta parameter is also grouped into similar distributions to reflect similar weekday and weekend behavior. We use the Poisson distribution for taxi pickups counts as it is one of the more frequently used discrete distributions in the probability space, and we used Gamma distribution for the parameters to simplify our calculations, which will be explained in the following section.

Calibrating the Hierarchical Model

Using the 2014 yellow cab data logs, we can calibrate the parameters in the hierarchical model. The closed form of the parameter distributions given data is difficult to compute; however, the Gibbs Sampler allows us to simulate these distributions as long as we can compute the posterior conditionals. The posterior conditionals describes the distribution of a parameter given that we know the values of the other parameters. By sequentially updating each parameter, we can simulate parameter values that will converge to the true distributions. 

Because all of our parameters are Gamma distributions, we derive that these posterior conditionals also are in fact also Gamma distributions, which makes it easier for us to simulate parameters values.

To test our derivations, we simulated data based on parameters that we designated, and see if the Gibbs Sampler manages to simulate a distribution that is close to the true value of those parameters. In the graphs below, the vertical lines represent the true value of the parameters we chose, and the bars represent the histogram of the simulated parameter values. We can see that these simulations very much represent the true value, so we are confident that our derivations were correct.

With the Gibbs Sampler constructed, we used actual yellow cab data to calibrate our model.


To summarize the results from my model, I have created three interactive graphics. (Note: I wasn't able to make the interactive functionalities work on Blogger. Clicking on the screenshots will lead to an HTML file that contains the interactive graphic)

The first graphic visualizes the mean number of hourly taxi pickups for a particular Manhattan census tracts over the course of the week. This allows us to compare weekday and weekend behavior and identify peak and off-peak hours.

The second graphic visualizes the mean number of hourly taxi pickups across the city at a given hour. This allows us to compare cab activity across various regions of Manhattan.

The third graphic visualizes distribution of hourly taxi pickups at different segments of the day for a given census tract. This gives more information than just the mean value so that readers get a better sense of how cab activity can vary.


Not all projects are perfect - here are a list of things I can definitely improve on for this analysis if I had more time.
  • Gathering a year worth of cab data only yields 52 weeks of data, which means only 52 data points for each hour of the week. With more computing power and more time, I would like to see whether some of the results would change if I grabbed both 2013 and 2015 data.
  • Looking at the model, the a's, alphas, p's and q's need to be calibrated before the Gibbs Sampler could be run. I set these values such that the expected mean of taxi pickups would equal to the mean of the data points by region and hour. This may not be the best way of initializing parameters, and if I had more time I would dig into this a bit deeper.
  • I would spend time investigating different hierarchical models, such as grouping parameters by census tracts rather than by hour and observe how the results may differ.
To see more mathematical details of my project, click here for my detailed writeup and my Github for coding implementations.