The string profiling pattern

This section describes the string profiling design pattern in which we use Pig scripts on textual data to know important statistics.


A majority of Big Data implementations deal with text data embedded in columns. To gain insight from these columns, they have to be integrated with other enterprise-structured data. This design pattern elaborates a few of the ways that help understand the quality of textual data.


The quality of textual data can be ascertained by applying basic statistical techniques on the values of the attributes. Finding the string length is the most important dimension in selecting the appropriate data types and sizes for the target system. You can use the maximum and minimum string ...

Get Pig Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.