Data standardization

Once the information extraction is complete and any necessary cleanup is done, we need to decide how we are going to save the outcome of this process. Typically, we can use a simple CSV (comma separated value) format for this data. If we are dealing with a complicated output format, we can choose XML (Extensible Markup Language) or JSON (javascript object notation) formats.

These formats are very much standard and almost all the technologies that we have today understand these very easily. But to keep things simple at first, it's good to start with CSV format.

Get Modern Big Data Processing with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.