15.5 Data Lakes

A data lake is a repository of raw data that is taken directly from data sources in real time and stored in its original form for possible later use. The term was coined in 2010 by James Dixon, founder of the business analytics company Pentaho. He described his use of a lake as an analogy in his blog (jamesdixon.wordpress.com), saying, “If you think of a datamart as a store of bottled water—cleansed and packaged and structured for easy consumption—the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

FIGURE 15.3 show the components of a data lake. Unlike a data warehouse, ...

Get Databases Illuminated, 4th Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.