Skip to Content
The Enterprise Big Data Lake
book

The Enterprise Big Data Lake

by Alex Gorelik
March 2019
Beginner to intermediate
221 pages
6h 35m
English
O'Reilly Media, Inc.
Book available
Content preview from The Enterprise Big Data Lake

Chapter 8. Cataloging the Data Lake

Data lakes tend to suffer from a number of traits that make them difficult, if not impossible, to navigate. They contain a massive number of data sets. Field names are often cryptic, and some types of data sets—such as delimited files and unstructured data collected from online comments—may lack header lines altogether. Even well-labeled data sets may have inconsistent names and different naming conventions. It is virtually impossible to guess what particular attributes may be called in different files, and thus impossible to find all instances of those attributes.

As a result, data needs either to be documented as new data sets are ingested or created in the lake or to go through extensive manual examination, neither alternative being scalable or manageable for the typical size and variety found in big data systems.

Data catalogs solve the problem by tagging fields and data sets with consistent business terms and providing a shopping-type interface that allows the users to find data sets by describing what they are looking for using the business terms that they are used to, and to understand the data in those data sets through tags and descriptions that use business terms. In this chapter we’ll explore some of the many uses of data catalogs, and take a quick look at some of the data cataloging products on the market today.

Organizing the Data

While the directory structure and naming conventions described in Chapter 7 can help analysts navigate ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Lake for Enterprises

Data Lake for Enterprises

Vivek Mishra, Tomcy John, Pankaj Misra
Operationalizing the Data Lake

Operationalizing the Data Lake

Holden Ackerman, Jon King

Publisher Resources

ISBN: 9781491931547Errata Page