Inmates: Part 3 - Text Analytics

August 30, 2016

Previously, on inmates...

Last week, we gained a better understanding of the Polk County, IA inmate population by exploring it graphically. One of the items that was a little more difficult to explore was the "Description of Arrest" information:

With our prior explorations of race and gender, there were few enough categories that we could visualize them with something like a bar graph. With over a hundred unique arrest descriptions, however, visual techniques would not be able to display the information we're interested in. Furthermore, there is a lot of overlap between arrest descriptions: four of the top arrest types are for some type of parole or probation violation. If we want to search for meaningful relationships between the arrest types and the rest of our data, we'll need to find a way to group the arrest types into a manageable number of categories.

Text Categorization

Manual approach. Looking at the list of most frequent arrest types, there are a few common themes that we could try to manually bucket the arrests into. The following code searches for keywords (or partial keywords) and assigns arrest types to one of seven buckets.

Inmates Part 3In [2]:
# several common words in the list of offenses so let's manually bucket them
# and see how those buckets are distributed
offense_bucket = list()
for each in data.Description:
    if "VIOLATION" in each:
        new = "Violation"
    elif "POSSESS" in each:
        new = "Possession"
    elif "ASSAULT" in each:
        new = "Assault"
    elif "THEFT" in each:
        new = "Theft"
    elif "MURDER" in each:
        new = "Murder"
    elif "NARC" in each:
        new = "Narcotics"
    else:
        new = "Other"
    offense_bucket.append(new)

data['OffenseBucket'] = pd.Series(offense_bucket, index=data.index)
data.OffenseBucket.value_counts()
Out[2]:

Other         313
Violation     261
Possession    117
Theft          66
Assault        66
Murder         18
Narcotics       7
Name: OffenseBucket, dtype: int64

The "Violation" and "Possession" buckets are reasonably sized, but the "Other" bucket is over a third of the entire inmate population, while the smallest bucket, "Narcotics", only has 7 calls. I'm thinking my buckets are not so good.

Rather than manually creating buckets, what if we let the the arrest types naturally group themselves into categories based on the words of the description?

We can use a term frequency-inverse document frequency matrix to calculate which terms are "important" to an arrest description, and then use k-means clustering to create a predefined number of categories with similar word importance.

One difficulty with k-means clustering is that you must know how many categories you'd like to create, often without knowing how many categories inherently exist in the data. We don't know exactly what is the "right" number of clusters in this situation, or whether there really is a "right" number, so we'll use a bit of a "kitchen sink" method, and try lots of things to see if anything sticks.

My process included:

Removing numeric characters from the text
Removing common words such as "degree", and "offense", which would tend to undesirably group disparate arrest types together
Transforming all terms to the same case, so that "POSSESSION" and "possession" are treated as equivalent
Performing stemming to isolate the roots of words, so that terms like "possession" and "possess" are treated as equivalent
Calculating the tf-idf matrix on the cleaned data
Trying a variety of solutions (2 to 100 clusters) and collecting statistics on each solution to identify whether there is a "best" one

The following assessment plot would ideally flatten out at the best number of clusters:

horizontal axis: number of clustersvertical axis: average Euclidean distance to nearest cluster centroid — horizontal axis: number of clusters
vertical axis: average Euclidean distance to nearest cluster centroid

In this case, I did not see a clear ideal number of clusters, and in trying different solutions there was always a sizable "miscellaneous" category. Therefore, we need to make a qualitative decision about which solution provides enough detail to be informative, but few enough categories to be manageable, accepting that we're just going to have a large "other" category.

After comparing the arrest descriptions of a variety of solutions, I settled on 15 categories. They each contain anywhere from 3 to 70 unique arrest types (out of a total of 190), and are distributed among the inmates as follows:

So our "Misc" category is about 25% smaller than our manual "Other" bucket was, and the process otherwise grouped arrest types into categories that make sense. For example, here are the (original, uncleaned) arrest types in the "Burglary" category:

BURGLARY 3RD DEGREE-UNOCCUPIED MOTOR VEHICLE-2ND OR SUBSEQUE
BURGLARY - 3RD DEGREE (VEHICLE)
BURGLARY 1ST DEGREE
ATTEMPTED BURGLARY 2ND DEGREE
BURGLARY 2ND DEGREE
BURGLARY 3RD DEGREE
BURGLARY 3RD DEGREE - MOTOR VEHICLE, 1st Offence
OPERATE VEHICLE WITHOUT OWNERS CONSENT-- Motor Vehicle
ATTEMPTED BURGLARY 3RD DEGREE
BURGLARY 3RD DEGREE - MOTOR VEHICLE - 2ND OR SUBSEQUENT OFFE

Obviously, the word "burglary" is pretty definitive for this category, which is why I named it what I did - however, we can see that four of the burglary descriptions also contain the word "vehicle", which is likely why we're getting the "OPERATE VEHICLE WITHOUT OWNERS CONSENT" arrest in this category as well.

The "Misc" category is pretty stubborn - it contains such varied categories that it just doesn't want to significantly break up even when increasing the number of categories, and many more than 15 categories would become difficult to work with. If we had more information about the arrest details, we might be able to make clean clusters by incorporating that information, but we really just have the text to work with here. Like much of analytics, this process is a handy tool that is superior to manual effort, but far from a silver bullet.

Now that we have some useful categorization in place, we can compare arrest types by race:

This bar chart is stacked to 100% to more clearly show each arrest category's racial makeup, and highlighting over a bar details the percentages. The numbers on each bar indicate the total number of inmates in that category, across races.

Remember from Part 1 that about 29% of inmates are black. I don't see any significant departures from that proportion in these arrest categories, with the exception of the "Weapons/Interference" category, which is over 50% black. Even with only 33 total inmates in this category, this is a statistically significant difference from 29% (p = 0.0036) - and of course, quite different from the Polk County demographics, which is 7% black.

It is difficult to draw conjecture about what may be behind this number. Although this one category is significantly different, it is still relatively small compared to the total number of black inmates. So although blacks are being arrested at a disproportionately high rate overall, it does not appear to me that these rates fluctuate much by crime type.

STAY TUNED...

That's all for today. I'm not sure about you, but my attention span is beginning to waver on this topic, so I may not get to all of the analyses I proposed in the last post. Let me know if there's anything you'd like me to explore before I close the books on this one!

Step 1 - gather data. I wrote a program to grab the data on every inmate listed on the Polk County website, so that I could work with it later. Data collection is not the most exciting step, but it's obviously necessary and it can take some time.
Step 2 - explore the data. Now that I had all the inmate data, I wanted to do some simple explorations. This occasionally required some extra manipulation of the data, and will be helpful background for the heavier-duty stuff.
Step 3 - statistical modeling. Using text mining techniques, I grouped the 190 unique arrest types into 15 categories.
Future Possibilities: Regression analysis - can we model inmates' bail amounts? Inmate clustering - are there "profiles" of inmates, e.g. older white men arrested for domestic abuse?

Inmates: Part 2

August 23, 2016

Previously, on inmates...

When we last left this topic, I had given you a sneak peek of a visual intended for this post, pictured at right.

I will get into plenty more visuals in a moment, but I would be remiss if I did not do my statistical due diligence and provide you with a p-value for that comparison, as one commenter so sagely suggested.

P-value is a term that you've probably at least heard bandied about. If you're not overly familiar with what it means, the simple version is this: we use statistics to estimate things we cannot know 100% for sure. As such, we've developed ways to assess the likelihood that our estimates are wrong. So we decide how often we're comfortable being wrong, and then we calculate a standardized number that tells us if the data support us being wrong only that often. Ish.

For the comparison I did between inmate and census data, I should use a one-sample proportion test. In doing this, I am treating the inmate population as if it were a random sample of the county population, and testing the assumption (or "null hypothesis") of whether the proportion of blacks in jail is the same as the proportion of blacks in Polk County. Although 7% vs 29% seems like a large difference, it could be that there aren't enough inmates to make this assertion.

In this case, we're looking at a total of about 850 inmates. When we do the math:

ZNum = (0.07 - raceprop2[1])
ZDen = (0.07*(1-0.07)/float(inmatepop))**.5
Z = ZNum/ZDen
# If the absolute value of Z is greater than 1.65, we reject the null hypothesis
print(Z)

-25.0942310125

# get the p-value
print(ndtr(Z))

2.87471807598e-139

we get a p-value very, very close to zero, in which case I'm pretty confident that blacks are being jailed at a disproportionately high rate. This is also consistent with findings from the Des Moines Register a few years back.

data exploration

So after a simple look at one variable, it's looking like Polk County is in the same metaphorical boat that we might have guessed before even seeing the data. What else can we find out?

Next to race, I was most curious about the gender distribution of inmates. I guessed that the female population was smaller than the male, and that appeared to be right. (note: clicking on the series name in the legend will filter the graph)

It's clear that there are many more male inmates than female inmates, although the proportion is a little more even for white inmates. Asians and Pacific islanders are quite close to parity but there are so few that this doesn't mean much.

Now, let's take a look at age. Most inmates are in their twenties and thirties, with very few over sixty.

This pattern is similar for both men and women, although there are actually a few more women in their thirties than in their twenties.

There are also more whites in their thirties than in their twenties. The proportion of black inmates in their twenties is also a little high compared to the general racial distribution: about 37% vs. 29%.

What kinds of things do people get arrested for?

The most common reasons inmates are in jail are probation and parole violations. I'm not quite sure what I was expecting to find with this one, but it wasn't that.

One thing you might notice in the above table is that the second entry seems to repeat itself. If there were multiple charges described on a single inmates' page, there would be separate entries (my code only captures the first). For some reason, there are some inmates who have a single charge description that reads exactly that way. I assume this is noise in the data, and is a good case for performing some type of cleaning and grouping task on the descriptions.

Finally, the last variable we'll examine is the dollar amount of the bond. As we can see below, the vast majority of inmates have a $0 bond.

And that pattern holds true by race as well.

All of the graphics discussed above can be reviewed in this dashboard. To explore the data visually, I used plotly, a python package that was unfamiliar to me. In the interest of time, I kept the settings pretty basic (read: please forgive the lack of axis titles). If I ever feel like messing around with JavaScript, I might enhance the interactivity and create linked plots, etc. But that's for a different post.

And of course, you can see all of the code used for this (and last week's) post on the code page.

Stay tuned...

That's all for this time! I hope you're enjoying our journey together as we dive into this topic further week by week. Our roadmap so far:

Step 1 - gather data. I wrote a program to grab the data on every inmate listed on the Polk County website, so that I could work with it later. Data collection is not the most exciting step, but it's obviously necessary and it can take some time.
Step 2 - explore the data. Now that I had all the inmate data, I wanted to do some simple explorations. This occasionally required some extra manipulation of the data, and will be helpful background for the heavier-duty stuff.
Step 3+ - statistical modeling. This is where we're headed next. There should be at least one more entry in this series, covering:
- Text clustering - grouping similar offense descriptions into categories
- Regression analysis - can we model inmates' bail amounts?
- Inmate clustering - are there "profiles" of inmates, e.g. older white men arrested for domestic abuse?

*photo: tiny police box at Trafalgar Square, spotted on our bike tour of London in June*

Inmates: Part 1

August 14, 2016

I do not pretend to be an expert on criminal or social justice. But as a topic that has been prominent in recent news and public opinion, I was curious about my local area's arrest statistics. Polk County, Iowa is predominantly white, and I wondered how arrest rates and other statistics varied by race, gender, etc.

If we assume that race plays no role whatsoever in a person's likelihood of being jailed, then we would expect the inmates' racial makeup to be pretty close to the county's overall demographics, which are as follows:

White alone	86%
Black or African American alone	7%
American Indian and Alaska Native alone	0%
Asian alone	4%
Native Hawaiian and Other Pacific Islander alone	0%
Two or More Races	2%
Hispanic or Latino	8%
White alone - not Hispanic or Latino	79%

Source: Census V2015 data

So how do our arrest records look? I searched to see whether this information was already published, and found the following:

The Des Moines Register - reports arrests from the last 60 days with some good summary reporting and visuals, but no reporting by race.
Polk County's website has detailed records for each current inmate, but very thin summary reports.

Polk County's page was the most detailed, readily available source I found. However, the details are buried in each individual inmate's page, which would be a pain to gather by hand. Luckily, I know a thing or two about web scraping. I wrote a simple python script to crawl through Polk County's current inmate listing, gather data on each inmate, and perform analyses on the resulting data. The current inmate population's racial makeup was (as of 8/11):

White68%
Black30%
Asian2%
Pacific Islander<1%

Some big differences there compared to the census data.

As the title image suggests, this is entry one in a series. Next time, I'll be slicing the data by more variables, and adding some visualizations. Here's a preview, with a plotly representation of the two tables above (hover over the bars to see values):