In this chapter, we are going to look at the Apache Spark SQL API. The SQL API allows us to write queries conforming to a subset of ANSI SQL:2003, which is the standard for the SQL database query language. The SQL API means we can store our data in files, probably in a data lake, and we can write SQL queries that access the data.
Before Apache Spark, Apache Hive was created by Facebook as a way to run SQL queries over data stored in Hadoop or even the Hadoop Distributed File System (HDFS). Apache Hive is made up of a “metastore” that is a set of metadata about files that allows developers ...