Exploring data using Spark SQL
Spark SQL is a relational query engine built on top of Spark Core. Spark SQL uses a query optimizer called Catalyst.
Relational queries can be expressed using SQL or HiveQL and executed against JSON, CSV, and various databases. Spark SQL gives us the full expressiveness of declarative programing with Spark dataframes on top of functional programming with RDDs.
Understanding Spark dataframes
Here's a tweet from
@bigdata announcing Spark 1.3.0, the advent of Spark SQL and dataframes. It also highlights the various data sources in the lower part of the diagram. On the top part, we can notice R as the new language that will be gradually supported on top of Scala, Java, and Python. Ultimately, the Data Frame philosophy is ...