Chapter 4. Spark SQL

Spark SQL is a Spark module for processing a structured data. This chapter is divided into the following recipes:

  • Understanding the Catalyst optimizer
  • Creating HiveContext
  • Inferring schema using case classes
  • Programmatically specifying the schema
  • Loading and saving data using the Parquet format
  • Loading and saving data using the JSON format
  • Loading and saving data from relational databases
  • Loading and saving data from an arbitrary source


Spark can process data from various data sources such as HDFS, Cassandra, HBase, and relational databases, including HDFS. Big data frameworks (unlike relational database systems) do not enforce schema while writing. HDFS is a perfect example where any arbitrary file is welcome during the ...

Get Spark Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.