Saturday, January 30, 2016

Exploring Chicago Crime and Housing Price Data

by Harold Li

Note from the bloggers: Barbara and I started DataBucket last year as a platform for us to explore our interest in data science. Our goal was to answer any questions that interested us - at first, it was sports analytics, then it was about food safety and restaurants, and then it was about startups and social media, and we also touched on the financial markets.

However, upon reviewing and reflecting on our DataBucket content in 2015, we realized that we were too focused on publishing popular content, rather than delving into projects that would enhance our data science toolkit.

Thus, we will be publishing less in 2016, and will be focusing on longer term projects. Our blog posts will appear much more explanatory rather than trying to prove a point, as we want to document our progress throughout our year.

This post on Chicago crime rates and housing price data is the first of a series of projects we plan to tackle in 2016. Please feel free to give feedback or provide any suggestions on further analysis.

I've always been interested in learning more about the relationship between crime rates and housing prices. When I was growing up, I would always hear my mother talking to other aunties, saying something like, "Yeah we moved here because it was a safer neighborhood. It's more expensive though but it's worth it." When I moved into the city, my fellow peers would always compare rent to how safe the area is. They would say, "Yeah, it's cheaper there, but the area is kinda sketchy."

Thus, I wanted to take a deeper dive into exploring the relationship between living costs and crime. To explore this relationship, I needed to find robust datasets for both metrics. The Zillow API had some promising housing data, but the Trulia API provided weekly averages and metrics of listed housing prices in each neighborhood. In terms of crime rates, New York City did not have detailed data on crimes. However, Chicago's Open Source platform has a massive dataset detailing all crimes that have occurred in the 21st Century. With these datasets, I decided to focus on Chicago and to answer three questions about the city:

1) How do we best quantify the relationship between crime rates and housing prices in Chicago?

2) Do increases or decreases in crime activity affect housing prices in Chicago?

3) How do crime rates in Chicago vary at different times of the week?

Cleaning the Data

We had to clean the datasets quite extensively, as the Trulia API had a list of neighborhoods of Chicago it had prices for, while the crime dataset listed only the location of each crime in longitude and latitude. Ultimately, we used Craig M. Booth's Github code to map these locations to a list of neighborhoods as defined by the Chicago Open Source portal.  Unfortunately, this neighborhood list is different from the Trulia neighborhood lists, so we mapped Trulia neighborhoods to the Chicago Open Source list. See my iPython Notebook in my Github for implementation details.

Part 1: Relationship between Crime Rates and Housing Prices in Chicago

With clean data in hand, we were able to count the number of crimes that occurred between 2012-2015 for each neighborhood and divide by its population to get a crime rate (i.e. the number of crimes an average person would commit in the past 4 years). We also averaged all the price listings for each neighborhood for that period of time to get an average housing price for each neighborhood. The following graphs that data, with the size of the data points representing the average number of house listings per neighborhood:

The best fit to this relationship was in fact a lasso regression, a form of regularized regression that puts constraints on the number of explanatory variables used in order to reduce bias. The best predictors for crime rates ended up being the inverse of price (1/Price) and the inverse of the squared price (1/Price^2). As a result, we conclude that there is a fairly strong inverse relationship between housing prices and crime rates.

The clear outlier to this relationship is the Loop neighborhood, Chicago's core business center. It has the 3rd highest housing prices, but the 6th highest crime rate. Why is it such an anomaly? The best answer may lie in our use of population to calculate the crime rate. The loop's official population is around 30,000 but far more people commute to this area every day. As crime rate is the ratio between number of crimes and residential population, the Loop's crime ratio is thus skewed upwards.

Part 2: Do Changes in Crime Activity Affect Housing Prices?

With historical data on crimes and housing prices dating back in 2012, we can trace how crime activity has changed in each neighborhood over time and whether we've seen correlated changes in housing prices as well. I would imagine that macroeconomic factors such as inflation, supply and demand, employment figures would play a larger role, but I would also argue that crime rates may also be a good indicator of the housing market.

Some neighborhoods have higher housing prices or higher crime rates, so their changes would appear greater. To compare apples to apples, a neighborhood's crime and price changes over a period of time is measured relative to the neighborhood's median change over the last 4 years (2012-2015). 

(Note: We also decided to remove Fuller Park data from this part of the analysis, as the community's average housing price is less than $100k and have less than 10 houses listed every week. We feel that the sample size is too small in this case, and it has resulted in data that is in a far different scale than all of the other neighborhoods)

This is the result we get by plotting 4-week changes in housing prices and crime rates from 2012-2015 - a widely scattered map showcasing no relationship at all:

The regression fit in this line in fact showed that a 1% change in crime activity would result in a -0.0027% change in housing prices. Or in other words, if crime activity doubles, we should expect house prices to change by a mere -0.27%. That doesn't say much at all, so we conclude that changes in crime rates are not a good predictor of house price fluctuations.

Part 3: Crime Activity at Different Times of the Week

Other than comparing crime activity with housing prices, I thought it would be useful to leverage our clean data to visualize Chicago crime rates at different times of the week on an hourly basis. This can be very useful for the Chicago police officers who can identify places that are more dangerous during their particular work shifts. It can also help answer several common questions, such as whether it is actually more dangerous to be outside at night time (and if so, when).

Visualizing Each Neighborhood in Detail

The first visualization presents the annual crime rates of a particular neighborhood in a per-hour basis. In other words, it represents the number of crimes that would occur if 100,000 people were in that neighborhood at that particular hour of the week.

(Note: click this link to choose the neighborhood you want to look at - I wasn't able to embed the interactive graphic into my blog post so I showed screenshots instead)

We can definitely see some interesting trends here - in almost every neighborhood, the safest time of the day is in fact in the early mornings, usually the time when children are going to school, but most dangerous in the late afternoons and early evenings when people are coming home from work. The Belmont Cragin township, shown below, is resemblant of a typical Chicago neighborhood.

Some notable exceptions include, once again, the Chicago Loop, which is a commercial, non-residential area that only houses people in the day time. Note that the times of the week where crimes are most prevalent is during the middle of the day rather than during times when people travel back home. It also makes sense that crime rates on weekdays (shaded in gray) are lower than crime rates on weekends (shaded in orange) as very few people work on weekdays. With less crowds, theft, assaults and various violations would occur less often.

Another interesting neighborhood to look at is Lake View, a part of the city known for its nightlife. As seen in the graphic below, crime rates are fairly consistent during the weekdays, but spike up during Friday nights and Saturday nights. This is no surprise, as intoxicated people do have the tendency to warrant more attention from the police.

Visualizing Chicago as a Whole

Another way of looking at this data is to look at the entire Chicago for a given hour of the week, and seeing which parts of the city is most dangerous as various times. The following graphic showcases the crime rates in Chicago on Wednesday 8 am. This is usually at a time where most neighborhoods have the lowest amount of crime activity, so it's not surprising that O'Hare International Airport (at the top left) has the highest crime rate in Chicago - it is indeed one of the busier times of the day for the transportation hub.

Chicago Crime Rates at Wednesday 8 am

Fast forward to noon time in Chicago, and it's the commercial center - the Loop - that has the highest crime activity. Again, not surprising since this is where the crowds are at this time of the day.

Chicago Crime Rates at Wednesday Noon

In the evening, it appears that Garfield Park is the dangerous area in Chicago.  Again this seems intuitive as this area is known to have violent crimes and dusk is usually when illegal activity occurs.

Chicago Crime Rates at Wednesday 7 pm

Finally, into the late hours of the night, we notice that Garfield Park remains one of the least safe places to be, but Englewood also skyrockets to the top. Again, this is expected, as Wikipedia calls it "one of the most dangerous neighborhoods in the city by almost every metric."

Chicago Crime Rates at Thursday 2 am

(Note: click on this link to choose the time of the week you want to look at - again, I wasn't able to embed the interactive graphic into my blog post so I showed screenshots instead).


If I had more time: I would have:

1) Figured out a way to embed my interactive graphic into this post;

2) Done the same visualizations for serious crimes only - it wouldn't require much work other than deciding which of the crime types listed in the Chicago crime dataset are actually very serious offenses;

3) Incorporate macroeconomic factors (e.g. inflation, demographics of community) into predicting crime rates - I wouldn't be surprised if these additions would explain crime activity better.

So there you have it - my project on Chicago's crime and housing data. To see the details of my analysis, check out the iPython Notebook in my Github. Would love to get your feedback and suggestions on more analyses that I can dig into.

Acknowledgments: I drew inspiration from Bokeh's gallery of sample visualizations and adapted code from their example libraries.

Thursday, January 14, 2016

Decision Making in Terms of a Search Problem

Life is a long path of decisions. Each decision leads you closer and closer to your goal - whatever that goal may be for you. Especially as a senior about to graduate from college, it looks pretty scary, and I don’t know exactly how to get to my end goal (the rainbows and sunshine and gold) at the end of the road.

When considering a person’s life decisions, let’s look at it in terms of a search problem. We can call life a tree with one start node, and many different possible end nodes. Each level represents a different time period in one’s life, with the node representing the state a person occupies at that time period (i.e. the root note is the state of “birth”). 

In this “life” tree, one cannot go backwards, although sometimes someone can go to the same node from different paths (graduate college  med school vs graduate college → gap year  med school). 

So what’s the best method for traversing this tree - especially if the optimal goal node is unknown at the beginning?

Classical search methods - uninformed search methods like breadth-first search and depth first search or informed search methods like A* with different heuristics are not possible for us. We cannot keep all paths in memory and then go back to a node in the past that might be more optimal than the one we are at currently. For example, you can’t say in senior year of Princeton, "I don’t think I like being a physics major. I want to go back to that other path I could have taken sophomore year - biology.” You would have to continue along with the choices in your current path - take extra time to switch majors and graduate, or just suck it up and graduate as physics.

No, for life, we have to perform local search - working with the states we occupy currently and the paths that lead from there. That means that in the tree diagram above, we can only transition to neighbor nodes of the node that we currently occupy at any time. 

Our state-space landscape, a visualization of states on the x-axis and the objective function on the y-axis, we can see that there is a global maximum, the highest overall point out of every possible state, and many local maxima. We can think of local maxima as a small hill where if you look around, everything in your line of sight is lower than you. Your local small hill might be shorter than Mount Everest, the world’s global maximum, but you don’t see it in your neighborhood.

The objective function can be money, or happiness, or whatever metric you choose. Doesn’t matter - it’s your life. So let’s start searching for the state that brings you the most of whatever you value.

The life tree for a lot of college students in my position looks like different career nodes leading from a single graduation node:

Now the conventional way lots of people choose to make this decision is hill-climbing search, also known as greedy local search. If your objective function is “money” or “prestige”, then it seems like it’s a logical move to go for the job, or node, that is most high-paying or prestigious at the moment then, right? This is what greedy local search does. It looks for the neighboring node with the highest objective value at each stage, and takes that path. That’s what makes students looking for the end goal of prestige grab onto the most prestigious job they can find, and then after two years, another more prestigious one, and then after two years, another title. It intuitively sounds like it makes sense to do this.

However, greedy local search has problems dealing with plateaus or local maxima. If your life plateaus - i.e. you don’t get a promotion that you expected, you stay stagnant for a long time at your job - where do you go? What do you do when there’s no higher-objective node to take next?

Greedy local search has problems with local maxima because it won’t sacrifice high prestige in order to find the absolute highest point you could reach - so it will rarely find the global maximum. This is what is going through people’s minds when they say, “I’m already 40, I could never go back to law school and become a lawyer like I always wanted. Besides, I have a high-paying job right now and I’m pretty comfortable.” Greedy local search will never take you past “feeling pretty comfortable” to “maximum possible potential”. 

Greedy local search is conservative. It, for the most part, makes sure you’re not homeless. It makes sure you have a food, shelter, a job that pays. But give you the chance to reach your dreams? For the most part, no. 

Another strategy is simulated annealing. Hill-climbing never makes downward moves, so it can get stuck in plateaus or local maxima. Simulated annealing, similarly to hill-climbing, will try to go up the objective function gradient, but for some steps, will do a random move. This is the person who has a safe job and does well at it for a few years, and then quits to go backpacking in South America to “find herself”. Sometimes it works, sometimes, not so much.

So what is the best local search strategy for life? Let’s look at an example:

Greedy local search, for someone in my situation (about to graduate college, confused about life, the usual) means taking the most prestigious or high-paying job out available, greedily. And then after that, it involves always taking the “best” step — going to consulting, going to business school, becoming a partner, retiring well. Sometimes people throw simulated annealing in there: starting a company, joining a small, promising company, traveling the world — but often that is a randomly taken path that doesn’t optimally build off of past experience.

Let’s do a case study of Elon Musk. He is someone who I, and many students in my position, see as reaching the pinnacle of money, prestige, and fulfillment.

Elon Musk. Random numbers.

His objective function was, I’m guessing, some combination of impact and prestige. He chose to go to Stanford for grad school and then immediately dropped out to form an Internet phonebook company, Zip2. This probably was not the most impactful nor prestigious job he could get. But I’m guessing he did it because the type of thinking and type of work that it involved really interested him. He truly wanted to learn about starting a business, about enterprise software engineering, etc. Indeed, the learning that he did there served him well in his future ventures.

Maybe the best local search strategy for life depends on another heuristic, or function that helps you decide on the next step to take in a search algorithm. What Elon Musk has been doing is a greedy search of knowledge and interest. Each step of the way, Elon Musk envisioned his end goal — prestige and fulfillment, particularly in areas that benefit the future of mankind — and took steps that gave him direct knowledge in that field. He wanted to learn about starting an Internet business? He started an Internet business, Zip2. He wanted to learn about disrupting an entire industry? He disrupted the banking industry with He wanted prestige and fulfillment, but he didn’t spring for a reputable company, or a non-profit, or even a reputable non-profit.

Mapping his possible state space against his knowledge gained will produce a very very different graph than mapping state space against prestige.

Of course, not everyone has a “field” in mind when their objective function is just general prestige or money. So as Peter Thiel says in Zero to One, many college students hedge their bets by choosing a general field and then transitioning to whatever catches fire in the future (consulting to virtual reality? anyone?). This can work, but I’m postulating that people have to take long bets on what they truly want to do, as many fields become more and more specific, in order to truly reach their highest potential. Starting off with the right foundational knowledge at least puts you in the neighborhood of your global maximum.

I strongly believe people should stop greedy local searching on prestige or money, and start searching on knowledge of a specific genre — the foundational tool that will lead you through life’s downward moves on prestige, but towards the global maximum of your life potential.

*pictures, except for #2 and 3, made in Google Docs

*this article also at