Chapter 3. Data Profiling Patterns

 

Time looking at your data is always well spent.

 
 --Witten, et al

In the previous chapter, you studied the various patterns for ingesting and egressing different types of data into and from the Hadoop ecosystem, so that the next logical steps in the analytics process can begin. In this chapter, we will understand the most widely used design patterns related to data profiling. This chapter is all about a step-by-step approach to diagnose if your dataset has any problem, and ultimately turning the dataset into usable information.

Data profiling is a necessary first step in getting any meaningful insight into the data ingested by Hadoop, by understanding the content, context, structure, and condition of data.

The data ...

Get Pig Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.