Chapter 11. Profiling Data
The art of data preparation is understanding the data set in order to determine what you might need to do to prepare it for analysis. Understanding the profile of the data is key to forming a full view of the data. Without profiling the data, you can easily miss an obvious preparation step or add in unnecessary work. This chapter will explore what profiling is, why profiling data is important, and how Prep profiles data.
What Is a Profile?
By profile, I mean the characteristics of the data set. As discussed in earlier chapters, understanding the types of data you have in the data set is essential to your analysis. Equally important is understanding the number and variance of the categorical data fields of the data set. Determining the data set’s level of granularity will help you to identify how many unique records there are, or whether there are duplicate records that you need to remove in the data preparation process. All of these factors form the foundation of the data set profile, which comprises these factors:
-
Minimum, maximum, and range of values: Does the range between the minimum and maximum values make sense?
-
Data outside of limits: Are there natural limits in the data, like 100%, or current dates that cannot be exceeded but have been?
-
Outliers: Do the values lie inside a certain range except for one or a few that sit outside of it?
-
Irregular number of records: Is there a consistent number of rows for certain dimensions, and does this ...
Get Tableau Prep: Up & Running now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.