Categorizing Non-Categorical Data
Problem
You need to perform a summary on a set of values that are mostly unique and do not categorize well.
Solution
Use an expression to group the values into categories.
Discussion
One important application for grouping by expression results is to
provide categories for values that are not particularly categorical.
This is useful because
GROUP
BY works best for columns with repetitive values.
For example, you might attempt to perform a population analysis by
grouping records in the states table using values
in the pop column. As it happens, that would not
work very well, due to the high number of distinct values in the
column. In fact, they’re all
distinct, as the following query shows:
mysql> SELECT COUNT(pop), COUNT(DISTINCT pop) FROM states;
+------------+---------------------+
| COUNT(pop) | COUNT(DISTINCT pop) |
+------------+---------------------+
| 50 | 50 |
+------------+---------------------+In situations like this, where values do not group nicely into a small number of sets, you can use a transformation that forces them into categories. First, determine the population range:
mysql> SELECT MIN(pop), MAX(pop) FROM states;
+----------+----------+
| MIN(pop) | MAX(pop) |
+----------+----------+
| 453588 | 29760021 |
+----------+----------+We can see from that result that if we divide the
pop values by five million, they’ll group into six categories—a reasonable number. (The category ranges will be 1 to 5,000,000; 5,000,001 to 10,000,000; and ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access