Chapter 49. Mind the Gap: Your Data Lake Provides No ACID Guarantees

Einat Orr

The modern data lake architecture is based on object storage as the lake, utilizing streaming and replication technologies to pour the data into the lake, and a rich ecosystem of applications that consume data directly from the lake, or use the lake as their deep storage. This architecture is cost-effective and allows high throughput when ingesting or consuming data.

So why is it still extremely challenging to work with data? Here are some reasons:

  • We’re missing isolation. The only way to ensure isolation is by using permissions or copying the data. Using permissions reduces our ability to maximize our data’s value by allowing access to anyone who may benefit from the data. Copying is not manageable, as you can then lose track of what is where in your lake.

  • We have no atomicity—in other words, we can’t rely on transactions to be performed safely. For example, there is no native way to guarantee that no one will start reading a collection before it has finished writing.

  • We can’t ensure cross-collection consistency (and in some cases, consisttency even for a single collection). Denormalizing data in a data lake is common; for example, for performance considerations. In such cases, we may write the same data in two formats, or index it differently, to optimize for two different applications or two different ...

Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.