Shark: Real-time queries and analytics for big data
Shark is 100X faster than Hive for SQL, and 100X faster than Hadoop for machine-learning
Hadoop’s strength is in batch processing, MapReduce isn’t particularly suited for interactive/adhoc queries. Real-time1 SQL queries (on Hadoop data) are usually performed using custom connectors to MPP databases. In practice this means having connectors between separate Hadoop and database clusters. Over the last few months a number of systems that provide fast SQL access within Hadoop clusters have garnered attention. Connectors between Hadoop and fast MPP database clusters are not going away, but there is growing interest in moving many interactive SQL tasks into systems that coexist on the same cluster with Hadoop.
Having a Hadoop cluster support fast/interactive SQL queries dates back a few years to HadoopDB, an open source project out of Yale. The creators of HadoopDB have since started a commercial software company (Hadapt) to build a system that unites Hadoop/MapReduce and SQL. In Hadapt, a (Postgres) database is placed in nodes of a Hadoop cluster, resulting in a system2 that can use MapReduce, SQL, and search (Solr). Now on version 2.0, Hadapt is a fault-tolerant system that comes with analytic functions (HDK) that one can use via SQL.
The rest of this post covers two relatively new open source tools: Impala and Shark. Since its release at Strata NYC, the buzz generated by Cloudera’s Impala system highlights3 how much the big data community wants a real-time query system in Hadoop. There have been many good articles written about Impala since its release (see here & here), so I won’t go into its design details. I will highlight the impressive performance numbers put out by Cloudera.
For purely I/O bound queries, we typically see performance gains in the range of 3-4x. … For queries with at least one join, we have seen performance gains of 7-45X. … If the data accessed through the query comes out of the cache, the speedup will be more dramatic thanks to Impala’s superior efficiency. In those scenarios, we have seen speedups of 20x-90x over Hive even on simple aggregation queries.
Shark is a component of Spark, an open source, distributed and fault-tolerant, in-memory analytics system, that can be installed on the same cluster as Hadoop. In particular, Shark is fully compatible with Hive and supports HiveQL, Hive data formats, and user-defined functions. In addition Shark can be used to query data4 in HDFS, HBase, and Amazon S3.
The creators of Shark just released a paper where they systematically compare its performance against Hive, Hadoop, and MPP databases. They found Shark to be much faster than Hive on a wide variety of queries: roughly speaking Shark on disk is 5-10X faster, while Shark in-memory is 100X faster. Significantly, Shark’s performance gains are comparable to those observed in MPP databases!
At this stage users have at least two working open source systems for fast/interactive SQL in Hadoop. While Impala has attracted more attention, the Shark team has quietly put together a highly-scalable system that has compelling features5, including data co-partitioning, fault-tolerance, and integration of machine-learning into an analyst’s workflow.
In-memory column store and column compression
The best performance gains in using Impala are achieved by using the Trevni columnar storage format. In the case of Shark, their custom columnar store and compression reduced storage and query time by about 5X.
Control over data partitioning => Fast, distributed JOINS
Shark lets users partition tables using a specified key. In particular if tables will be “joined” frequently, then one can partition them using a common (“join”) key. Co-partitioning is a trick used by many MPP databases to speed up “joins” involving massive tables.
Shark recovers6 gracefully from node failures, and continues executing queries after it reconstructs lost (data) partitions. Initial tests on large data sets indicate recovery has a small performance impact (and is much faster than re-executing a query).
Shark has implemented a simple optimizer (partial DAG execution or PDE) that uses data statistics (heavy hitters, approximate histograms) to dynamically alter query plans when needed. As an example Shark’s PDE system uses data statistics to perform run-time optimizations for “joins”.
RDD’s are distributed objects that can be cached in-memory, across a cluster of compute nodes. They are the fundamental data objects used in Spark. Users can create RDD’s (using the sql2rdd command) and apply machine-learning functions to them, all from within Shark. Currently machine-learning and analytic functions can be written in Scala and Java, with support for Python coming soon. Not only do users get the benefit of performing simple SQL queries and complex computations from within the same7 framework, Shark is 100X faster than Hadoop:
Integration with BI tools
Impala integrates with Tableau and Qlikview. There are Shark users who have used it with tools like Tableau, but BI integration is a relatively “unexplored” area within Shark.
Impala and Shark are interactive SQL systems for Hadoop. A new paper shows Shark offers speedups that are comparable to those observed in MPP databases. In addition to being 100X faster than Hive for SQL, Shark is a framework that is 100X faster than Hadoop for (iterative) machine-learning algorithms.
- Seven reasons why I like Spark
- Spark 0.6 improves performance and accessibility
- HadoopDB: An Open Source Parallel Database
(1) In the Hadoop world real-time, means rapid response times necessary for interactive query-response tasks.
(2) SF startup DrawnToScale has a competing product, built on top of HBase, Hadoop, and Lucene.
(3) There was a lot of chatter about Impala at Strata NYC, and the talk given by its creators was packed!
(4) Shark supports many file formats including text, binary, sequence files, JSON, and XML.
(6) Impala is not fault-tolerant: in the event a node fails, queries must be restarted.
(7) One of the things that makes SAS software popular is this type of tight integration between data wrangling (via PROC SQL or the data step) and analysis (PROC GLM, LOGISTIC, …)