Chapter 30. Deduplicating
Understanding the level of granularity in your data is key to preparing it well for analysis. When investigating granularity, however, you might find some unclear answers. The cause of this lack of clarity is often duplications. This chapter will go through how to recognize duplicates in your data set and what you can do about them.
How to Identify Duplicates
Unless you are intentionally looking for duplicates, you are relying on someone “knowing the numbers” to inform you that something’s wrong. Therefore, it is important that you actively try to prevent duplicates in your data set and know how to remove them if required. Removing duplicates makes aggregations easier because you can simply sum records to find totals, which in turn makes the resulting data set easier to analyze.
Let’s look at an example where a system captures when orders come in to Chin & Beard Suds Co. When analyzing orders, we’d expect to have a data set where each order has its own row. When loading a data set into Prep Builder, you can easily determine:
-
How many rows there are for each order (here represented as Case ID)
-
Whether there is an even distribution of rows as shown in the Profile pane
By clicking on a single Case ID, you can see in the Data pane that there are multiple rows per ID, and different IDs have a different number of rows in the data set (Figure 30-1).
Get Tableau Prep: Up & Running now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.