The unstructured text data validation and cleansing design pattern

The unstructured text validation and cleansing pattern demonstrates ways to cleanse unstructured data by applying various data pre-processing techniques.

Background

Processing huge amounts of unstructured data with Hadoop is a challenging task in terms of cleaning it and making it ready for processing. Textual data, which includes documents, mails, text files, and chat files, is inherently unorganized without a defined data model when it is ingested by Hadoop.

In order to open the unstructured data for analysis, we have to bring in a semblance of structure to it. The foundation of organizing unstructured data is to integrate it with structured data existing in the enterprise by performing ...

Get Pig Design Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.