The unstructured text profiling pattern

This section describes the unstructured text profiling design pattern in which we use Pig scripts on free-form text data to know important statistics.

Background

Text mining is done on unstructured data ingested by Hadoop to extract interesting and non-trivial meaningful patterns, from blocks of meaningless data. Text mining is an interdisciplinary field, which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics, to accomplish the extraction of meaningful patterns from text. Typically, the parallel-processing power of Hadoop is used to process massive amounts of textual data, to classify documents, cluster tweets, build ontologies, extract entities, perform ...

Get Pig Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.