Tool of Thought

APL for the Practical Man

On Categorical Data

December 6, 2023

Consider getting a useful first impression and understanding of a single column in a database table, (or a vector of values all of the same type). If there are only a few unique values in the column, say a dozen or less, then a frequency distribution is appropriate. We get an immediate, informative overview of the data, regardless of the type. This is easily displayed in a bar chart. Here we have the distribution of stints for major league baseball players in 2019. A stint is a period of time with a particular team. We can see that most players spent the entire season with one team, while 12 players played for 3 teams:

Created by Causeway SVG engine - SharpPlot v3.71.0 Paint the paper ===== Border ===== for X-axis labels 200 400 600 800 1,000 1,200 1,400 Heading, subheading and footnotes ===== Region ===== Y-axis labels 3 2 1 X-Axis Ticks ===== X-Axis tickmarks Y-Axis Ticks ===== Y-Axis tickmarks Start of Horizontal Barchart =========== Axes ===== Data value labels ... 12 146 1,410 Reset to original origin

However, as the number of unique values grows, a frequency distribution becomes less and less useful. When every value is unique, the distribution degenerates into the entire original column catenated with a vector of 1's. For quantitative or temporal data, this problem is easily solved by grouping into bins or buckets, reducing the number of categories. Here we have the number of games played per player for 2019:

Created by Causeway SVG engine - SharpPlot v3.71.0 Paint the paper ===== Border ===== for X-axis labels 50 100 150 200 250 300 350 400 450 500 550 600 Heading, subheading and footnotes ===== Region ===== Y-axis labels 160 to 179 140 to 159 120 to 139 100 to 119 80 to 99 60 to 79 40 to 59 20 to 39 0 to 19 X-Axis Ticks ===== X-Axis tickmarks Y-Axis Ticks ===== Y-Axis tickmarks Start of Horizontal Barchart =========== Axes ===== Data value labels ... 11 84 81 60 78 137 161 391 565 Reset to original origin

However, if the data is categorical, it is generally not possible to meaningfully group the data. One option is to produce a frequency distribution that shows only the top 10 (say) categories, grouping the remainder into an "other" category. This works well when there are many categories, and the categories are of varying sizes, keeping the "other" category relatively small:

Created by Causeway SVG engine - SharpPlot v3.71.0 Paint the paper ===== Border ===== for X-axis labels 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 Heading, subheading and footnotes ===== Region ===== Y-axis labels Other Moseley Crozet Suffolk Chesapeake Clarksville Chesterfield Midlothian Charlottesville Virginia Beach Richmond X-Axis Ticks ===== X-Axis tickmarks Y-Axis Ticks ===== Y-Axis tickmarks Start of Horizontal Barchart =========== Axes ===== Data value labels ... 4,500 100 200 200 200 200 400 500 500 1,100 2,100 Reset to original origin

If there are many categories and they are similar-sized, this breaks down. Here we have a distribution of the PlayerID column, which is mostly unique, except for players that have done multiple stints in the season:

Created by Causeway SVG engine - SharpPlot v3.71.0 Paint the paper ===== Border ===== for X-axis labels 200 400 600 800 1,000 1,200 1,400 1,600 Heading, subheading and footnotes ===== Region ===== Y-axis labels Other mejiaad01 maldoma01 josepco01 fontwi01 dullry01 darnatr01 broxtke01 biddlje01 austity01 altheaa01 X-Axis Ticks ===== X-Axis tickmarks Y-Axis Ticks ===== Y-Axis tickmarks Start of Horizontal Barchart =========== Axes ===== Data value labels ... 1,538 3 3 3 3 3 3 3 3 3 3 Reset to original origin

What, then, is to be done in the case of a categorical column with many evenly distributed unique values? If a frequency distribution is inadequate how about a frequency distribution of the frequency distribution? That is, a table displaying the number of values that occur once, the number of values that occur twice, the number of values that occur three times, etc.:

Value OccursUnique CountTotal Rows
Once1,2641,264
Twice134268
Three times1236
Four times00
Five +00
Total1,4101,568

This table is a much more useful first look at high-variance categorical data. For example, it is immediately apparent if the values are unique and suitable for a key column. It is easy to identify outliers, that is duplicate values or triplicate values. Let's call this a second-order frequency distribution.

By inspection we can tell whether a first-order or second-order distribution will be more useful, and come up with some back-of-the-envelope algorithm to make the choice, which may well be sufficient. But is there a way to actually compute the variance of a categorical column and use that measure to determine what exactly is "high-variance" categorical data? That question and some APL code will be explored in a future post.