Chapter 26. Cleaning by Grouping Data
Data preparation would not be necessary if we always had someone else curating a perfect data set for us. However, we can (and sadly often have to) clean the data ourselves. As mentioned in Chapter 9, one of the most common challenges you’ll face in data prep is cleaning up string data—for example, standardizing the string values enough to be able to count instances of values even when they have typos. One technique can come in especially handy for this scenario: grouping. This chapter will cover what grouping is and how to use the built-in grouping tools in Prep Builder.
What Does Grouping Mean?
Grouping means applying logic to (mostly) string data fields to recognize a common characteristic among them, such as their meaning or intended value. For example, we might expect the following data items to be grouped together:
-
Edinburgh
-
Edenburgh
-
Edinborough
-
3d!nburgh
As humans, we can recognize that all these different names probably all refer to Edinburgh, Scotland (especially if the column were called City Name). But data software does not see this data the same way, so we have to give it some direction for how to handle these different collections of characters.
Why Use Grouping
Grouping is a technique you need to learn for multiple reasons.
Improving Accuracy
When they hear the term “data,” most people seem to think about system-generated data being fed into databases. After a year of data analysis, most data workers would consider ...
Get Tableau Prep: Up & Running now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.