Chapter 3. Overview of Apache Drill
Apache Drill is a distributed schema-on-read query engine loosely associated with the Hadoop ecosystem. This chapter unpacks this statement so you understand what Drill is and how it works before we dive into the details of using, deploying, and extending Drill.
The Apache Hadoop Ecosystem
Many excellent books exist to describe Apache Hadoop and its components. Here we expand on the introduction in Chapter 1 with the key concepts needed to understand Drill.
Hadoop consists of the Hadoop Distributed File System (HDFS), MapReduce, and the YARN job scheduler. Drill is best thought of as an alternative to YARN and MapReduce for processing data stored in HDFS.
The extended Hadoop ecosystem (sometimes called “Hadoop and Friends”) includes a wide variety of tools. For our purposes these include the following:
-
Alternative storage engines (e.g., MapR–FS, and Amazon S3)
-
Database-like storage engines (e.g., HBase and MapR-DB)
-
Compute engines (e.g., MapReduce and Apache Spark)
-
Query engines (e.g., Apache Hive, Drill, Preso, Apache Impala, and Spark SQL)
-
Coordination tools (e.g., Apache ZooKeeper and etcd)
-
Cluster coordinators (e.g., manual, YARN, Mesos, and Docker/Kubernetes)
This list does not begin to touch on the many tools for coordinating workflows (e.g., Oozie and AirFlow), data ingest (e.g., Sqoop and Kafka), or many other purposes.
Drill Is a Low-Latency Query Engine
With so many options, you might wonder: where does Drill fit into ...
Get Learning Apache Drill now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.