In the past month, we analyzed the performance of NBA Players in clutch moments, soccer players before and after transfers, and tennis players when they serve first or second. This post, we move away from sports and decided to look into US incarceration rates. More specifically, we were interested in understanding how one’s name, gender and age may correlate to his or her probability of being in jail. Would an “Adam” right out of college be more likely to go to jail than an “Adam” that has been in the workforce for thirty years? Do names that hint towards a particular ethnicity, such as “Jose” or “Mohamed,” lead to a higher or lower incarceration rate? Using available jail and census data, we estimate the probability that a person with any name, gender and age will go to jail.
We decided to calculate two probabilities:
- Probability that one will go to jail given first name, age and gender:
- Probability that one will go to jail given last name, age and gender:
To calculate these metrics, we utilize an intuitive probability rule, which states that the chance two events A and B occur is equivalent to the probability that A occurs given B occurs multiplied by the probability that B occurs:
Solving for the conditioned probability yields a fraction known as Bayes’ Rule:
Applying these equations to our two metrics yield the following equations:
The numerator data was calculated using data of inmates situated in Rhode Island, Mississippi and Orange Country, totaling nearly 200,000 prisoners, as well as current data on the number of inmates in the United States (more than 2 million). The denominator data was calculated using census data and social security administration data.
Notice that the denominator of the last name metric is not conditioned on age or gender. This is because we assume that last name frequencies are independent of gender and age. We also couldn’t find data that provided the conditioned probability, which further motivated this simplification.
Our Findings on First Name
Our findings indicated certain first name probabilities that were surprisingly high. Upon inspection, we realized that most of these outliers are very uncommon names that only occur in my inmate dataset a countable number of times. For example, “Khari”, “Anibal” and “Rudolfo” had the highest incarceration probability for males in age groups 20-29, 40-49 and 60-69 respectively in our data. These are names that are not well known, so in our method, the probability of jail given those names is high. That said, there are some common that top our list. “Frederick” and “Antonio” top the 10-19 and 80-89 men age groups, and “Jeanine” tops our 70-79 female age group.
Our Findings on Last Name
The top 20 last names with highest rates for each age group can be found in the table below:
Again, we face the problem in which our data has a few occurrences of uncommon last names. Thus, the probability of going to jail given those last names is especially high, since the probability of having those last names comprises the denominator in our method. For last names, however, we did not want to constrain ourselves to the 100 most common last names because we felt that it would detract from the diversity of the names. Perhaps for future studies, it would be best to constrain ourselves to the top 1000 names or so on, in order to capture diversity but also limit the denominator effects of extremely uncommon last names.
In case the form above does not work, here are the links to the spreadsheets with the jail probabilities in decimal form (0.1 = 10%). The columns are ages 10-19, 20-29, 30-39, etc. with the last category being age 100+.
Male First Name
Male Last Name
Female First Name
Female Last Name
Some of the scraping scripts are on Github