Previously, on inmates...

When we last left this topic, I had given you a sneak peek of a visual intended for this post, pictured at right.

I will get into plenty more visuals in a moment, but I would be remiss if I did not do my statistical due diligence and provide you with a p-value for that comparison, as one commenter so sagely suggested.

P-value is a term that you've probably at least heard bandied about. If you're not overly familiar with what it means, the simple version is this: we use statistics to estimate things we cannot know 100% for sure. As such, we've developed ways to assess the likelihood that our estimates are wrong. So we decide how often we're comfortable being wrong, and then we calculate a standardized number that tells us if the data support us being wrong only that often. Ish.

For the comparison I did between inmate and census data, I should use a one-sample proportion test. In doing this, I am treating the inmate population as if it were a random sample of the county population, and testing the assumption (or "null hypothesis") of whether the proportion of blacks in jail is the same as the proportion of blacks in Polk County. Although 7% vs 29% seems like a large difference, it could be that there aren't enough inmates to make this assertion.

In this case, we're looking at a total of about 850 inmates. When we do the math:

ZNum = (0.07 - raceprop2[1])
ZDen = (0.07*(1-0.07)/float(inmatepop))**.5
Z = ZNum/ZDen
# If the absolute value of Z is greater than 1.65, we reject the null hypothesis
print(Z)

-25.0942310125

# get the p-value
print(ndtr(Z))

2.87471807598e-139

we get a p-value very, very close to zero, in which case I'm pretty confident that blacks are being jailed at a disproportionately high rate. This is also consistent with findings from the Des Moines Register a few years back.

data exploration

So after a simple look at one variable, it's looking like Polk County is in the same metaphorical boat that we might have guessed before even seeing the data. What else can we find out?

Next to race, I was most curious about the gender distribution of inmates. I guessed that the female population was smaller than the male, and that appeared to be right. (note: clicking on the series name in the legend will filter the graph)

It's clear that there are many more male inmates than female inmates, although the proportion is a little more even for white inmates. Asians and Pacific islanders are quite close to parity but there are so few that this doesn't mean much.

Now, let's take a look at age. Most inmates are in their twenties and thirties, with very few over sixty.

This pattern is similar for both men and women, although there are actually a few more women in their thirties than in their twenties.

There are also more whites in their thirties than in their twenties. The proportion of black inmates in their twenties is also a little high compared to the general racial distribution: about 37% vs. 29%.

What kinds of things do people get arrested for?

The most common reasons inmates are in jail are probation and parole violations. I'm not quite sure what I was expecting to find with this one, but it wasn't that.

One thing you might notice in the above table is that the second entry seems to repeat itself. If there were multiple charges described on a single inmates' page, there would be separate entries (my code only captures the first). For some reason, there are some inmates who have a single charge description that reads exactly that way. I assume this is noise in the data, and is a good case for performing some type of cleaning and grouping task on the descriptions.

Finally, the last variable we'll examine is the dollar amount of the bond. As we can see below, the vast majority of inmates have a $0 bond.

And that pattern holds true by race as well.

All of the graphics discussed above can be reviewed in this dashboard. To explore the data visually, I used plotly, a python package that was unfamiliar to me. In the interest of time, I kept the settings pretty basic (read: please forgive the lack of axis titles). If I ever feel like messing around with JavaScript, I might enhance the interactivity and create linked plots, etc. But that's for a different post.

And of course, you can see all of the code used for this (and last week's) post on the code page.

Stay tuned...

That's all for this time! I hope you're enjoying our journey together as we dive into this topic further week by week. Our roadmap so far:

Step 1 - gather data. I wrote a program to grab the data on every inmate listed on the Polk County website, so that I could work with it later. Data collection is not the most exciting step, but it's obviously necessary and it can take some time.
Step 2 - explore the data. Now that I had all the inmate data, I wanted to do some simple explorations. This occasionally required some extra manipulation of the data, and will be helpful background for the heavier-duty stuff.
Step 3+ - statistical modeling. This is where we're headed next. There should be at least one more entry in this series, covering:
- Text clustering - grouping similar offense descriptions into categories
- Regression analysis - can we model inmates' bail amounts?
- Inmate clustering - are there "profiles" of inmates, e.g. older white men arrested for domestic abuse?