Skip to Content
Learning Spark
book

Learning Spark

by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia
February 2015
Intermediate to advanced
276 pages
7h 18m
English
O'Reilly Media, Inc.
Content preview from Learning Spark

Chapter 9. Spark SQL

This chapter introduces Spark SQL, Spark’s interface for working with structured and semistructured data. Structured data is any data that has a schema—that is, a known set of fields for each record. When you have this type of data, Spark SQL makes it both easier and more efficient to load and query. In particular, Spark SQL provides three main capabilities (illustrated in Figure 9-1):

  1. It provides a DataFrame abstraction in Python, Java, and Scala that simplifies working with structured datasets. DataFrames are similar to tables in a relational database.

  2. It can read and write data in a variety of structured formats (e.g., JSON, Hive Tables, and Parquet).

  3. It lets you query the data using SQL, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC), such as business intelligence tools like Tableau.

Under the hood, Spark SQL is based on an extension of the RDD model called a DataFrame. A DataFrame contains an RDD of Row objects, each representing a record. A DataFrame also knows the schema (i.e., data fields) of its rows. DataFrames store data in a more efficient manner than native RDDs, taking advantage of their schema. In addition, they provide new operations not available on RDDs, such as the ability to run SQL queries. DataFrames can be created from external data sources, from the results of queries, or from regular RDDs.

Tip

DataFrames are an evolution of SchemaRDDs available ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Learning Spark, 2nd Edition

Learning Spark, 2nd Edition

Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee
Learning PySpark

Learning PySpark

Tomasz Drabas, Denny Lee
Spark: The Definitive Guide

Spark: The Definitive Guide

Bill Chambers, Matei Zaharia
High Performance Spark, 2nd Edition

High Performance Spark, 2nd Edition

Holden Karau, Adi Polak, Rachel Warren

Publisher Resources

ISBN: 9781449359034Errata PageSupplemental Content