© Ed Elliott 2021
E. ElliottIntroducing .NET for Apache Sparkhttps://doi.org/10.1007/978-1-4842-6992-3_6

6. Spark SQL and Hive Tables

Ed Elliott1  
(1)
Sussex, UK
 

In this chapter, we are going to look at the Apache Spark SQL API. The SQL API allows us to write queries conforming to a subset of ANSI SQL:2003, which is the standard for the SQL database query language. The SQL API means we can store our data in files, probably in a data lake, and we can write SQL queries that access the data.

Before Apache Spark, Apache Hive was created by Facebook as a way to run SQL queries over data stored in Hadoop or even the Hadoop Distributed File System (HDFS). Apache Hive is made up of a “metastore” that is a set of metadata about files that allows developers ...

Get Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.