Chapter 10. Spark with Big Data

As we mentioned in Chapter 8, Spark SQL, the big data compute stack doesn't work in isolation. Integration points across multiple stacks and technologies are essential. In this chapter, we will look at how Spark works with some of the big data technologies that are part of the Hadoop ecosystem. We will cover the following topics in this chapter:

  • Parquet: This is an efficient storage format
  • HBase: This is the database in the Hadoop ecosystem

Parquet - an efficient and interoperable big data format

We explored the Parquet format in Chapter 7, Spark 2.0 Concepts. To recap, Parquet is essentially an interoperable storage format. Its main goals are space efficiency and query efficiency. Parquet's origin is based on Google's ...

Get Fast Data Processing with Spark 2 - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.