Chapter 4. Spark SQL

Spark SQL is a Spark module for processing a structured data. This chapter is divided into the following recipes:

Understanding the Catalyst optimizer
Creating HiveContext
Inferring schema using case classes
Programmatically specifying the schema
Loading and saving data using the Parquet format
Loading and saving data using the JSON format
Loading and saving data from relational databases
Loading and saving data from an arbitrary source

Introduction

Spark can process data from various data sources such as HDFS, Cassandra, HBase, and relational databases, including HDFS. Big data frameworks (unlike relational database systems) do not enforce schema while writing. HDFS is a perfect example where any arbitrary file is welcome during the ...

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Spark Cookbook by Rishi Yadav

Chapter 4. Spark SQL

Introduction

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly