September 2018
Intermediate to advanced
398 pages
9h 43m
English
The answer is yes and the module is called Spark SQL. Spark SQL sits on top of Spark Core and allows the manipulation of structured data. Unlike with the basic RDD API, the DataFrame API provides more information to the Spark engine. Using this information, it can change the execution plan and optimize it.
You can also use the module to execute SQL queries as you would with a relational database. It makes it easy for people comfortable with SQL to run queries on heterogeneous sources of data. You can, for instance, join a data table coming from a CSV file with another one stored in Parquet in a Hadoop filesystem and with yet another one coming from a relational database.
Read now
Unlock full access