Spark SQL
Spark SQL is a component on top of Spark core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data. Spark SQL provides functions for manipulating large sets of distributed, structured data using an SQL subset supported by Spark and Hive QL. Spark SQL simplifies the handling of structured data through DataFrames and datasets at a much more performant level as part of the Tungsten initative. Spark SQL also supports reading and writing data to and from various structured formats and data sources, files, parquet, orc, relational databases, Hive, HDFS, S3, and so on. Spark SQL provides a query optimization framework called Catalyst to optimize all operations to boost ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access