Chapter 9.  News Dictionary and Real-Time Tagging System

While a hierarchical data warehouse stores data in files of folders, a typical Hadoop based system relies on a flat architecture to store your data. Without proper data governance or a clear understanding of what your data is all about, there is an undeniable chance of turning data lakes into swamps, where an interesting dataset such as GDELT would be nothing more than a folder containing a vast amount of unstructured text files. For that reason, data classification is probably one of the most widely used machine learning techniques in large scale organizations as it allows users to properly categorize and label their data, publish these categories as part of their metadata solutions, and ...

Get Mastering Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.