Carl Steinbach

The Best of Strata Santa Clara 2013: SQL on Hadoop

Defining the New Generation of Analytic Databases

Date: This event took place live on October 02 2013

Presented by: Carl Steinbach

Duration: Approximately 60 minutes.

Cost: Free

Questions? Please send email to


Join us for an exclusive presentation by Carl Steinbach recorded live from his Strata Santa Clara 2013 Talk.

The analytics and data warehousing industries are in the midst of a major period of transformation and upheaval. Since the publication nearly a decade ago of Google's seminal MapReduce and GFS papers, we have witnessed the appearance of Apache Hadoop, followed closely by the arrival of batch-oriented SQL systems like Apache Hive, and the scramble by established SQL vendors to implement Hadoop connectors.

This talk addresses the recent emergence of a new generation of analytic databases inspired by Google Dremel. These databases have been designed with the goal of running real-time SQL natively on Hadoop in a manner that fully exploits the flexibility and performance of the underlying platform. Characterized by features including schema-on-read, support for semi-structured data, and pluggable storage engines, and defined by systems like Citus Data's CitusDB and Cloudera's Impala, these new systems share important architectural details that distinguish them from the previous generation of analytic databases.

In this talk we will discuss the unavoidable cost and performance limitations of the connector-based approach employed by many established vendors and explain the long-term significance of Apache Hive's data model along with its influence on next generation SQL-on-Hadoop databases. We will then unravel the novel architectural features common to next generation analytic database systems like CitusDB and Impala that make real-time SQL-on-Hadoop feasible. Finally, we will conclude by reviewing several important database lessons learned over the previous decades that remain relevant today.

About Carl Steinbach

Carl Steinbach is a software engineer at Citus Data, as well as a committer and PMC member on the Apache Hive project. Previously Carl worked at Cloudera where he led the Hive team, at NetApp where he developed storage encryption products, and at Oracle where he was a member of the Server Technologies group. Carl holds B.S. and M.Eng. degrees in Computer Science from MIT.

You may also be interested in:

Strata Conference + Hadoop World 2013