A quick refresher: I was curious about the racial makeup of my local county's inmates, and was disappointed in my attempts to find this information readily gathered online. Therefore, I decided to take matters into my own hands, and wrote a program to scrape inmate records from the Polk County Sherriff's website and compile them in a usable format. I found--among other patterns--that while blacks make up only 7% of the Polk County population, they make up about 29% of inmates. Intrigued, I looked for other patterns, and was surprised by how many inmates were charged with probation and parole violations. I wanted to look further into the actual crimes committed by inmates, but with well over a hundred unique arrest descriptions, I needed to use text mining techniques to bucket them into a more manageable number of categories.
With all of that background work completed, I can now do some deeper modeling. I'm not entirely sure what I'll find, if anything at all. But I won't know until I try!
In this post, I will attempt to answer the following question:
Can we predict how much an inmate's bond will be?
Typically, for this type of question, I would immediately try to model every bond amount in my data set using the same technique. However, recall this graph from our data exploration in part two of this series:
That giant peak at the left represents all of the bonds (over 500 of them) that are $0-$10,000. It turns out that about 60% of those are actually $0, or no bond at all. Also, notice how the graph stretches on for quite a ways up to $1 million with very few peaks. These data are highly skewed, which will create problems for our model. Indeed, when I threw this against a traditional linear model, I found that the higher the actual bond amount, the worse my prediction was. A good model will have no discernible pattern to its errors - because if you can identify the pattern, you can model it out, right?
A common fix for skewed data is to calculate the natural logarithm, which shrinks the numeric difference between 1000 and 100 to single digits. The graphs below illustrate this transformation.
However, calculating the logarithm of zero is mathematically impossible - the graph on the right excludes all the $0 bonds - so this solution would be impractical for a large portion of the data. To get around this, I will take a two-phased approach: first, I will build a model to predict whether or not an inmate has a bond, and then I will build another model to predict how much the bond will be (for those who have bonds).
bond or no bond?
To predict whether an inmate has a nonzero bond, I used a decision tree, which attempts to find the series of attributes which will most effectively split the inmates into those who have zero and nonzero bonds. Leveraging all of the variables at my disposal (age, eye color, hair color, race, sex, weight, height, and text-mined arrest category) produced the following tree:
The tree reads as follows: if the arrest description falls into the text-mining-derived buckets of "Interference", "Probation Violation", or "Trespassing" (housed under the variable ClusterName), then the inmate is predicted to have a $0 bond. Otherwise, if the inmate has an arrest that falls into the "Misc" category, they are predicted to have a nonzero bond. Otherwise, we start getting into some odd factors - hair color, height, age, eye color, and race start to appear. Those don't inherently make sense, and are likely showing up due to the limited useful information we really have. It would probably be more useful to have information on the inmates' priors, for example. In decision trees, the most important variables show up the earliest, and in this tree the arrest description categories are doing the heavy lifting. We could decide to "prune" this tree and base the model entirely on the arrest description (and no, I'm not just being punny, we really call it pruning).
Let's see how this tree performs by comparing its predictions to reality:
(real) FALSE TRUE (pred) FALSE 190 40 (pred) TRUE 65 377
So, there are 40 instances of the model predicting a $0 bond but being wrong, and 65 instances of predicting a non-zero bond incorrectly. However, the tree is correct a whopping (190 + 377)/(190 + 40 + 65 + 377) , or 84%, of the time! That's actually a lot better than I would have expected for such a simple model. This appears to be because whether or not the inmate is set a bond has a lot to do with the actual crime committed, which is what we would hope.
In fact, if I create a new tree just using the raw arrest descriptions, I get the following:
This tree proves to be even more accurate at about 90%.
Now that I have a good method of understanding what makes an inmate have a $0 bond or not, I'm ready to model the dollar amount of the nonzero bonds using a linear model.
Much like with the decision tree, I first try a "kitchen sink" method, throwing all of the variables into the model. Doing so results in only the arrest categories being significant, and our errors still have the problem of increasing as the bond value increases, as seen below. Furthermore, other model diagnostics (r-squared and adjusted r-squared for my fellow geeks) indicate that this model does a bad job of explaining the bond data - it can account for less than 25% of the variation in bond values.
Since only the arrest categories were significant, perhaps this model could benefit from using the raw arrest description the way that the decision tree did. I was hesitant to do this from the start because it would be like adding 190 variables to the model for only 700 or so observations (in the training set). However, as with the decision tree, doing just that creates a huge improvement in the model (up to 75% of variation explained), and greatly reduces the pattern from the error plot:
So which arrest types are associated with higher or lower bond amounts? There were 34 significant arrest types and I won't list them all, but the highest bond amounts are associated with first degree murder (27 to 40 times higher than the baseline bond amount of $25,000). The next highest bond amounts are associated with harassment and probation violations at 7% of baseline. Finally, the lowest bond amounts are associated with providing false information, and possession of drug paraphernalia (just 1.2% of the baseline). In my inexpert opinion, that seems to scale with the severity of the crime.
To this point, I have been running my diagnostics on my training data, which is a random 80% of the data that I use to create the model. That would be something like trying to guess the secret ingredient in a dish you cooked yourself. To really put a model through its paces, I should see how it performs on data that wasn't used to build it.
So how does the decision tree perform on the 20% holdout sample?
(real) FALSE TRUE (pred) FALSE 43 4 (pred) TRUE 19 103
The results are consistent with what I found in training at 86% accuracy. This suggests that we haven't "overfit" the data, or created a model so specific to the training data that it can't be extended to new observations.
For the actual bond amount, I ran into an unforeseen problem - my validation data had raw arrest descriptions that did not occur in the training data. In hindsight, with about 190 unique arrest descriptions in the combined data sets, that isn't entirely surprising. To get around this, I had to ignore several arrest records and leave them without a prediction. After that was done, I compared the predicted values to the truth:
Both models pass validation.
If you're feeling like that was a lot of work, graphs, and statistics to prove a common-sense point, you're not wrong. Of the inmate data available on the county website, the specific crime is by far the most important factor in determining whether a bond is assigned and how much it is. This is, of course, exactly what we would hope and assume to be the case. But for a skeptic like myself it's always nice to have validation from a little data - therefore work, graphs, and statistics.
I had actually hoped to include inmate profiling in this post and make this the series finale, but it took on a life of its own and I thought I'd spare us all for this week. So next time, we'll take a look at some profiles of inmate "types" that I identified using machine learning techniques - e.g. "middle-aged white women arrested for probation violations".
What questions do you have about the world around us? Have you ever wondered how data could help you answer them? Let me know in the comments if there's anything you'd like to see me cover in a future post!