Chapter FourWhy Build a Data Lake?

As we explored in the previous section, over time, teams see their sources of information proliferate. As business operations scale, so does the number of places data flows in from, or maybe the number of input channels remains unchanged but the size of those streams increases dramatically. This alone is challenging. Then remember that each source may run on a different system or be run by a different organization. Each source could have its own domain‐specific dialect of SQL or API calls, its own default time zone, its own set of permissions and system for managing them, its own owners, and its own limitations on how the data within them can be queried or visualized. The siloed‐off source model breaks down into a mess of disparate data.

A new abstraction layer is needed to make sense of this new reality. Consider that sooner or later, even if a team increases analyst headcount, it becomes impossible to sort and maintain sources. The key insight is that there should be one place where all data collects rather than sources all in different siloes. There should be a unified way (or ways) of requesting that data from this unified source.

image

Figure 4.1 A data lake containing multiple data sources.

Just like we recommended replicating production database systems, we encourage replicating all sources into a single data store. This collection of ...

Get The Informed Company now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.