Chapter 5. Analyzing Complex and Nested Data

As Chapter 4 demonstrated, Apache Drill is a very powerful tool for analyzing data contained in delimited files. In this chapter, you will learn how to apply that power to complex and nested datasets and formats such as JavaScript Object Notation (JSON) and Parquet. Data contained in NoSQL stores such as MongoDB often contains nested data structures that make it difficult to query in the traditional SQL context. These data formats often require specialized tools to analyze, but with Drill you can query them just as you would any other dataset—albeit with some additional complexities. Before you dive into these datasets, however, you must understand how Drill deals with complex data objects.

A Word About Parquet Format

Parquet is a self-describing, compressed columnar format that supports nested data. Many big data systems such as Hadoop, Hive, Spark, and others support reading and writing Parquet files. Drill performs best reading Parquet files, so we recommend that if you are planning on querying large, complex data you convert the data into Parquet format.

Arrays and Maps

In Chapter 4 you learned about all the different data types that exist in Drill, such as INTEGER, DOUBLE, and VARCHAR. These data types are common in most databases and programming languages, but unlike most databases, Drill also features two complex data types, array and map, that you’ll need to understand in order to analyze complex datasets.1 Both of these ...

Get Learning Apache Drill now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.