Chapter 8. Cataloging the Data Lake

Data lakes tend to suffer from a number of traits that make them difficult, if not impossible, to navigate. They contain a massive number of data sets. Field names are often cryptic, and some types of data sets—such as delimited files and unstructured data collected from online comments—may lack header lines altogether. Even well-labeled data sets may have inconsistent names and different naming conventions. It is virtually impossible to guess what particular attributes may be called in different files, and thus impossible to find all instances of those attributes.

As a result, data needs either to be documented as new data sets are ingested or created in the lake or to go through extensive manual examination, neither alternative being scalable or manageable for the typical size and variety found in big data systems.

Data catalogs solve the problem by tagging fields and data sets with consistent business terms and providing a shopping-type interface that allows the users to find data sets by describing what they are looking for using the business terms that they are used to, and to understand the data in those data sets through tags and descriptions that use business terms. In this chapter we’ll explore some of the many uses of data catalogs, and take a quick look at some of the data cataloging products on the market today.

Organizing the Data

While the directory structure and naming conventions described in Chapter 7 can help analysts navigate ...

Get The Enterprise Big Data Lake now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.