My last post was relatively analytics-heavy, which is not exactly what I'm aiming for here on the blog. While I was able to accurately predict an inmate's bond amount, it was probably not the most digestible information, and in reality represented just one piece of the complex view into the local inmate population.
What if we could take a step back and gain a different perspective?
Similar to how I used text mining techniques in part three to group the arrest descriptions into higher-level buckets of similar arrest types, I can identify naturally occurring, relatively homogeneous groups of inmates using the information at my disposal.
As with last time, I don't know how many groups best describe the data, so I tried a variety of solutions, trying out 2 to 20 groups and finding a solution where the groups start to become very similar without creating more groups than we can handle. Of course, the more groups created, the more homogeneous each group will be. After all, if each data point were in its own "group", then each "group" would be perfectly defined. In the trials that I ran, I found a 5-group solution to strike the right balance between similarity within groups and number of groups created.
The below graphic summarizes the inmates profiles that my process identified, as well as the number of inmates assigned to each. Please forgive my terrible Canva skills.
Interestingly, three out of five profiles were white, only one was female, and all but two fell under the "misc" arrest type. The two non-white profiles could basically be called "young black men" and "middle-aged black men", although the latter is a pretty small group, having just 32 inmates.
Essentially we are seeing some of the most common combinations of age, race, sex, and arrest type, which are necessarily going to include respective attributes that were the most common. For example, we had very few inmates over the age of 60, so we wouldn't expect to see a group that includes that age range in its definition unless we expanded to a large number of groups. This makes sense and is consistent with the results of my data exploration from part two.
This process does assign every inmate to one of the five groups, so those older inmates--as well as inmates from other races and inmates who committed crimes from other categories--are mixed in with the "closest" group, as defined by the algorithm. That is one of the imperfections of such a process, and a compromise that we often have to make in analytics: accuracy vs. usability. This analysis is less accurate than last week's, but much easier to understand.
This, at long last, concludes my month-long series on Polk County inmate data. Between web scraping, data exploration, text mining, binary and linear modeling, and now clustering, I certainly learned a lot about the local inmate population. I hope you learned something too.
Want to try something like this on your own data? Head over to the code page for the source code behind each step of this project!