Inmates: Part 3 - Text Analytics — bright lights big data

Previously, on inmates...

Last week, we gained a better understanding of the Polk County, IA inmate population by exploring it graphically. One of the items that was a little more difficult to explore was the "Description of Arrest" information:

With our prior explorations of race and gender, there were few enough categories that we could visualize them with something like a bar graph. With over a hundred unique arrest descriptions, however, visual techniques would not be able to display the information we're interested in. Furthermore, there is a lot of overlap between arrest descriptions: four of the top arrest types are for some type of parole or probation violation. If we want to search for meaningful relationships between the arrest types and the rest of our data, we'll need to find a way to group the arrest types into a manageable number of categories.

Text Categorization

Manual approach. Looking at the list of most frequent arrest types, there are a few common themes that we could try to manually bucket the arrests into. The following code searches for keywords (or partial keywords) and assigns arrest types to one of seven buckets.

Inmates Part 3

In [2]:

# several common words in the list of offenses so let's manually bucket them
# and see how those buckets are distributed
offense_bucket = list()
for each in data.Description:
    if "VIOLATION" in each:
        new = "Violation"
    elif "POSSESS" in each:
        new = "Possession"
    elif "ASSAULT" in each:
        new = "Assault"
    elif "THEFT" in each:
        new = "Theft"
    elif "MURDER" in each:
        new = "Murder"
    elif "NARC" in each:
        new = "Narcotics"
    else:
        new = "Other"
    offense_bucket.append(new)

data['OffenseBucket'] = pd.Series(offense_bucket, index=data.index)
data.OffenseBucket.value_counts()

Out[2]:

Other         313
Violation     261
Possession    117
Theft          66
Assault        66
Murder         18
Narcotics       7
Name: OffenseBucket, dtype: int64

The "Violation" and "Possession" buckets are reasonably sized, but the "Other" bucket is over a third of the entire inmate population, while the smallest bucket, "Narcotics", only has 7 calls. I'm thinking my buckets are not so good.

Rather than manually creating buckets, what if we let the the arrest types naturally group themselves into categories based on the words of the description?

We can use a term frequency-inverse document frequency matrix to calculate which terms are "important" to an arrest description, and then use k-means clustering to create a predefined number of categories with similar word importance.

One difficulty with k-means clustering is that you must know how many categories you'd like to create, often without knowing how many categories inherently exist in the data. We don't know exactly what is the "right" number of clusters in this situation, or whether there really is a "right" number, so we'll use a bit of a "kitchen sink" method, and try lots of things to see if anything sticks.

My process included:

Removing numeric characters from the text
Removing common words such as "degree", and "offense", which would tend to undesirably group disparate arrest types together
Transforming all terms to the same case, so that "POSSESSION" and "possession" are treated as equivalent
Performing stemming to isolate the roots of words, so that terms like "possession" and "possess" are treated as equivalent
Calculating the tf-idf matrix on the cleaned data
Trying a variety of solutions (2 to 100 clusters) and collecting statistics on each solution to identify whether there is a "best" one

The following assessment plot would ideally flatten out at the best number of clusters:

horizontal axis: number of clustersvertical axis: average Euclidean distance to nearest cluster centroid — horizontal axis: number of clusters
vertical axis: average Euclidean distance to nearest cluster centroid

In this case, I did not see a clear ideal number of clusters, and in trying different solutions there was always a sizable "miscellaneous" category. Therefore, we need to make a qualitative decision about which solution provides enough detail to be informative, but few enough categories to be manageable, accepting that we're just going to have a large "other" category.

After comparing the arrest descriptions of a variety of solutions, I settled on 15 categories. They each contain anywhere from 3 to 70 unique arrest types (out of a total of 190), and are distributed among the inmates as follows:

So our "Misc" category is about 25% smaller than our manual "Other" bucket was, and the process otherwise grouped arrest types into categories that make sense. For example, here are the (original, uncleaned) arrest types in the "Burglary" category:

BURGLARY 3RD DEGREE-UNOCCUPIED MOTOR VEHICLE-2ND OR SUBSEQUE
BURGLARY - 3RD DEGREE (VEHICLE)
BURGLARY 1ST DEGREE
ATTEMPTED BURGLARY 2ND DEGREE
BURGLARY 2ND DEGREE
BURGLARY 3RD DEGREE
BURGLARY 3RD DEGREE - MOTOR VEHICLE, 1st Offence
OPERATE VEHICLE WITHOUT OWNERS CONSENT-- Motor Vehicle
ATTEMPTED BURGLARY 3RD DEGREE
BURGLARY 3RD DEGREE - MOTOR VEHICLE - 2ND OR SUBSEQUENT OFFE

Obviously, the word "burglary" is pretty definitive for this category, which is why I named it what I did - however, we can see that four of the burglary descriptions also contain the word "vehicle", which is likely why we're getting the "OPERATE VEHICLE WITHOUT OWNERS CONSENT" arrest in this category as well.

The "Misc" category is pretty stubborn - it contains such varied categories that it just doesn't want to significantly break up even when increasing the number of categories, and many more than 15 categories would become difficult to work with. If we had more information about the arrest details, we might be able to make clean clusters by incorporating that information, but we really just have the text to work with here. Like much of analytics, this process is a handy tool that is superior to manual effort, but far from a silver bullet.

Now that we have some useful categorization in place, we can compare arrest types by race:

This bar chart is stacked to 100% to more clearly show each arrest category's racial makeup, and highlighting over a bar details the percentages. The numbers on each bar indicate the total number of inmates in that category, across races.

Remember from Part 1 that about 29% of inmates are black. I don't see any significant departures from that proportion in these arrest categories, with the exception of the "Weapons/Interference" category, which is over 50% black. Even with only 33 total inmates in this category, this is a statistically significant difference from 29% (p = 0.0036) - and of course, quite different from the Polk County demographics, which is 7% black.

It is difficult to draw conjecture about what may be behind this number. Although this one category is significantly different, it is still relatively small compared to the total number of black inmates. So although blacks are being arrested at a disproportionately high rate overall, it does not appear to me that these rates fluctuate much by crime type.

STAY TUNED...

That's all for today. I'm not sure about you, but my attention span is beginning to waver on this topic, so I may not get to all of the analyses I proposed in the last post. Let me know if there's anything you'd like me to explore before I close the books on this one!

Step 1 - gather data. I wrote a program to grab the data on every inmate listed on the Polk County website, so that I could work with it later. Data collection is not the most exciting step, but it's obviously necessary and it can take some time.
Step 2 - explore the data. Now that I had all the inmate data, I wanted to do some simple explorations. This occasionally required some extra manipulation of the data, and will be helpful background for the heavier-duty stuff.
Step 3 - statistical modeling. Using text mining techniques, I grouped the 190 unique arrest types into 15 categories.
Future Possibilities: Regression analysis - can we model inmates' bail amounts? Inmate clustering - are there "profiles" of inmates, e.g. older white men arrested for domestic abuse?