Performing Advanced Analytics on Relational Data with Spark SQL
Date: This event took place live on April 29 2014
Presented by: Michael Armbrust
Duration: Approximately 60 minutes.
Questions? Please send email to
Hosted By: Ben Lorica
In this webcast, we'll examine Spark SQL, a new Alpha component that is part of the Apache Spark 1.0 release. Spark SQL lets developers natively query data stored in both existing RDDs and external sources such as Apache Hive. A key feature of Spark SQL is the ability to blur the lines between relational tables and RDDs, making it easy for developers to intermix SQL commands that query external data with complex analytics. In addition to Spark SQL, we'll explore the Catalyst optimizer framework, which allows Spark SQL to automatically rewrite query plans to execute more efficiently.
About Michael Armbrust
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage, and query optimization.
About Ben Lorica
Ben Lorica is the Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media, Inc.. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services. He is an advisor to Databricks.