Chapter 9. Building Reliable Data Lakes with Apache Spark

In the previous chapters, you learned how to easily and effectively use Apache Spark to build scalable and performant data processing pipelines. However, in practice, expressing the processing logic only solves half of the end-to-end problem of building a pipeline. For a data engineer, data scientist, or data analyst, the ultimate goal of building pipelines is to query the processed data and get insights from it. The choice of storage solution determines the end-to-end (i.e., from raw data to insights) robustness and performance of the data pipeline.

In this chapter, we will first discuss the key features of a storage solution that you need to look out for. Then we will discuss two broad classes of storage solutions, databases and data lakes, and how to use Apache Spark with them. Finally, we will introduce the next wave of storage solution, called lakehouses, and explore some of the new open source processing engines in this space.

The Importance of an Optimal Storage Solution

Here are some of the properties that are desired in a storage solution:

Scalability and performance

The storage solution should be able to scale to the volume of data and provide the read/write throughput and latency that the workload requires.

Transaction support

Complex workloads are often reading and writing data concurrently, so support for ACID transactions is essential to ensure the quality of the end results.

Support for diverse data ...

Get Learning Spark, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.