Chapter 6. Connecting Drill to Data Sources

In previous chapters you learned how to query individual files, but Apache Drill’s real power is unleashed when you connect Drill to multiple data sources. You have already seen how Drill can natively query data in a file-based system, but it also can natively query the following data sources:

  • Cloud storage (Amazon Simple Storage Service/Microsoft Azure/Google Cloud Platform)

  • Hadoop

  • HBase

  • Hive

  • Kafka

  • Kudu

  • MapR

  • MongoDB

  • Open Time Series Database (Open TSDB)

Additionally, Drill can query any system that provides a JDBC driver. In this chapter, you’ll learn how to configure Drill to access and query all these different data sources. Drill accesses storage via a system of extensions known as storage plug-ins that require activation and configuration in order to query an external data source. This chapter assumes that you have a basic familiarity with the various data sources mentioned.

Querying Multiple Data Sources

Up to this point, you have seen only queries that use the dfs, or distributed filesystem, storage plug-in.1 To query a different data source, you must configure a storage plug-in for that data source and then include it in the query. If you recall from Chapter 3, the FROM clause in a Drill query is structured as follows:

FROM storage_plugin.workspace.table

As an example, let’s say you want to query a Hive cluster and you have created and configured the storage plug-in and given it the name hive. You could query ...

Get Learning Apache Drill now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.