Chapter 12. Sampling Data Sets
To sample or not to sample? That is the question. In a world where data volumes are growing, storage solutions are getting cheaper, and data creation is easier than ever, data preppers must decide whether to use a sample subset and understand the implications of doing so. This chapter will look at why sampling should be used with caution, when you might need to sample, and what techniques you can use to sample data in Prep Builder.
One Simple Rule: Use It All If Possible
The reason we use data is to find the story, trends, and outliers that will help us make better decisions in our everyday and working lives. So why not aim to use all the data and information you can?
Using the full data set is not always possible, though, frequently due to the size of the data set. The reason Preppin’ Data exists is because data often needs to be prepared for analysis. To do that, we need to know what is possible to clean completely and what is not. If it isn’t possible to clean data sets completely, then it makes sense to remove sections that can’t be cleaned. This is not what is meant by sampling, though. Sampling means using a subset of the full data set—not because the data can’t be cleaned but for lots of other reasons.
Sampling to Work Around Technical Limitations
A sample allows you to take the data you need to clean and freeze it in time to deal with the two main technical challenges of data prep:
- Volume of data
- A sample lets you set up your analysis ...
Get Tableau Prep: Up & Running now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.