Chapter 4. Querying Delimited Data

In this chapter we get started with Apache Drill by querying simple data. We are defining “simple data” as data contained in a delimited file such as a spreadsheet or comma-separated values (CSV) file, from a single source. If you have worked with SQL databases, you will find that querying simple data with Drill is not much different than querying data from a relational database. However, there are some differences that are important to understand in order to unleash the full power of Drill. We assume that you are familiar with basic SQL syntax, but if you are not, we recommend SQL Queries for Mere Mortals (Addison-Wesley) by John Viescas as a good SQL primer.

To follow along, start Drill in embedded mode as explained in Chapter 2 and download the example files, which are available in the GitHub repository for this book.

Ways of Querying Data with Drill

Drill is very flexible and provides you with several different ways of querying data. You’ve already seen that Drill has a command-line interface known as SQLLine. In addition to the command line, Drill has a web interface that you can access by opening a browser (after Drill is running) and navigating to http://localhost:8047, as shown in Figure 4-1.

Figure 4-1. The Drill Web Console

Other Interfaces

You can also query data with Drill by using Drill’s RESTful interface or by using a conventional tool such as Tableau and using Drill’s ODBC or JDBC interfaces. We cover these interfaces later in the book and demonstrate how to query Drill using your favorite scripting languages, such as Python or R.

Lastly, if you install the MapR ODBC driver for Drill, bundled with that is a tool called Drill Explorer, which is another GUI for querying Drill.

For the beginning examples, we recommend that you start out by using the Web Console because there are certain features that are most easily accessed through it.

Drill SQL Query Format

Drill uses the ANSI SQL standard with some extensions around database table and schema naming. So, if you are comfortable with relational databases most of Drill’s SQL syntax will look very familiar to you. Let’s begin with a basic query. Every SQL SELECT query essentially follows this format:

SELECT list_of_fields
FROM one_or_more_tables
WHERE some_logical_condition
GROUP BY some_field

As in standard SQL, the SELECT and FROM clauses are required, the WHERE and GROUP BY clauses are optional. Again, this should look relatively familiar to you, and this chapter will focus on Drill’s unique features in SQL.

Choosing a Data Source

In most ways, Drill behaves like a traditional relational database, with the major difference being that Drill treats files (or directories) as tables. Drill can also connect to relational databases or other structured data stores, such as Hive or HBase, and join them with other data sources, but for now, let’s look at querying a simple CSV file in Drill:

SELECT first_name, last_name, street_address, age
FROM dfs.user_data.`users.csv`
WHERE age > 21

This queries a sample data file called users.csv and returns the first name, last name, and street address of every user whose age is greater than 21. If you are familiar with relational databases, the only part that might look unusual is this line:

FROM dfs.user_data.`users.csv`

In a traditional database, the FROM clause contains the names of tables, or optionally the names of additional databases and associated tables. However, Drill treats files as tables, so in this example, you can see how to instruct Drill which file to query.

Drill accesses data sources using storage plug-ins, and the kind of system your data is stored in will dictate which storage plug-in you use. In Drill, the FROM clause has three components that do slightly different things depending on what kind of storage plug-in you are using. The components are the storage plug-in name, the workspace name, and the table name. In the previous example, dfs is one of the default storage plug-ins; “dfs” stands for distributed filesystem and is used to access data from sources such as HDFS, cloud storage, and files stored on the local system.1 The next component is optional; it’s the workspace, which is a shortcut to a file path on your system. If you didn’t want to use a workspace, you could enter the following:

FROM dfs.`full_file_path/users.csv`

Finally, there is the table, which this storage plug-in corresponds to a file or directory. In addition to the dfs plug-in, Drill ships with the cp or classpath plug-in enabled, which queries files stored in one of the directories in your Java classpath, typically used to access the example files included with Drill.

As illustrated in Figure 4-2, the Storage Configuration panel is where you can enable, disable, and configure the various storage plug-ins. You should resist the temptation to enable all the disabled plug-ins because you will get bad performance and strange error messages if you try to use Drill with an improperly configured storage plug-in enabled.

Figure 4-2. Drill storage plug-in configuration

Defining a Workspace

In Drill parlance, when you are using file-based systems, a workspace is a shortcut to a file path. To create a workspace, open the Drill web interface and then, at the top of the screen, click the Storage tab.

Next, click the Update button next to the dfs plug-in (see Figure 4-2) and you will see the configuration variables. You will see a section called workspaces, and there you will see the default workspaces, root and tmp, as shown here:

"workspaces": {
    "root": {
      "location": "/",
      "writable": false,
     "defaultInputFormat": null
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
     "defaultInputFormat": null
    }
 },

To create your own workspace, simply add another entry into the workspaces section with a path to a location on your local filesystem:

"drill_tutorial": {
  "location": "your_path",
  "writable": true,
  "defaultInputFormat": null
}

You must have read permission to be able to read data in your workspace, and it goes without saying that if you want to write to this path, you must have write permissions. You can configure workspaces as being writable or not, as well as setting a default file format. Often you will want to take the results of complex queries and create tables or views, and the workspace must be writable in order to store a view or new table. After you’ve created your workspace, at the bottom of the screen, click the Update button.

You can verify that your workspace was successfully created by executing a SHOW DATABASES statement; you should see your newly created workspace in the results.

View Your Configuration by Using the DESCRIBE Statement

In addition to the SHOW DATABASES statement, you can also view the plug-in configuration by using the DESCRIBE SCHEMA statement. For instance, you can view the configuration used in a notional workspace called test by using the statement DESCRIBE SCHEMA dfs.test.

Specifying a Default Data Source

The examples thus far use file paths in the FROM clause. In practice, these clauses can become long, and it is tedious to type them over and over in an interactive session. Drill has the idea of a default data source, which is used if you don’t specify a data source (storage plug-in and workspace, also called a namespace) in your query. You use the USE statement to select the data source to use as the default, as shown here for dfs.data: 

USE dfs.data;
SELECT *
FROM `users.csv`
WHERE age > 20

So, the preceding query is functionally equivalent to the query that follows.

SELECT *
FROM dfs.data.`users.csv`
WHERE age > 20

SQL Sessions

Drill is a SQL tool. SQL uses the idea of a session that is tied to a Drill connection. Session state, such as the USE statement, applies only to that one session.

JDBC, ODBC, and most BI tools create sessions; however, Drill’s Web Console does not, and therefore the USE operator will not work in the Web Console. The Web Console uses Drill’s RESTful interface. When you execute queries using the RESTful interface, each query is executed in its own session and the USE statement unhelpfully lasts only for the single REST request in which it was submitted.

Saving Your Settings

By default, in embedded mode Drill saves all settings in the /tmp directory, which often does not persist if the machine is rebooted. To save your settings while running in embedded mode, you will need to modify the drill-override.conf file, which is located in the DRILL_HOME/conf/ folder, by adding the following:

sys.store.provider.local.path="path to directory"

to the section drill.exec and supplying a nontemporary path, as shown here:

drill.exec: {
    cluster-id: "drillbits1",
    zk.connect: "localhost:2181",
    sys.store.provider.local.path="path_to_save_file"
 }

It is a better practice to set up a single-node Drill cluster with ZooKeeper to save state correctly. When running Drill in server mode, Drill saves state in ZooKeeper so that it can be shared by multiple Drillbits and can be persisted across server restarts.

After you have specified a namespace with the USE operator, you can still access other namespaces, but you will need to explicitly list them, as shown here:

USE dfs.data
SELECT *
FROM `users.csv` AS u
INNER JOIN dfs.logs.`logfile.txt` AS l ON l.id=u.id

You can see in this example that the query joins data from users.csv with another file called logfile.txt, which is in a completely different namespace.

Accessing Columns in a Query

Relational databases are considered schema-on-write in that the schema must be defined prior to writing the data on disk. More recently, tools such as Apache Spark use a technique called schema-on-read, which allows more flexibility because it allows users to modify the schema as the data is being ingested rather than being tied to a rigid, predefined schema.

Drill takes a different approach and relies on the data’s natural structure for its queries. Suppose that we have a file called customer_data.csv that contains three columns of data: first_name, last_name, and birthday. Drill infers the schema directly from the structure of the data and, as such, you do not need to define any kind of schema prior to querying your data with Drill.

As an example, Drill will allow you to query this data directly by simply executing the following query:

SELECT *
FROM dfs.drill_tutorial.`customer_data.csv`

However, when you execute this query you will notice that Drill returns a single column for each row of data. This one column is called columns and is an array of VARCHAR as shown in the following table:

columns
["Robert","Hernandez","5/3/67"]
["Steve","Smith","8/4/84"]
["Anne","Raps","9/13/91"]
["Alice","Muller","4/15/75"]

Drill infers a schema directly from your data. In this case, we have a CSV file without column headings. Since Drill doesn’t know the name or number of fields, it instead returns the fields as an array. (We will see shortly how column headings change the results.) This represents a major difference between Drill and relational databases: Drill supports nested data structures such as arrays and maps, whereas conventional relational databases do not. In Drill, just as in many other programming languages, arrays are zero-indexed, meaning that the first item in an array is the 0th item. Individual array items are accessed by using notation similar to most programming languages:

arrayName[n]

Where n is the numeric index of the item that you want to access. Although CSV uses a single array, other data sources (such as JSON) can contain a mix of nested columns and flat columns.

Getting back to our delimited data file, if you want to access the individual columns within this array, you will need to use the columns array in your query. Prior to Drill 1.15, the name columns must be in lowercase. The query that follows demonstrates how to access these columns within a query and filter data using one of these fields:

SELECT columns[0] AS first_name,
columns[1] AS last_name,
columns[2] AS birthday
FROM dfs.drill_tutorial.`customer_data.csv`
WHERE columns[1] = 'Smith'

This example first creates simple columns for some of the fields by using the AS clause. In SQL terminology, doing so projects the array elements to new columns. This query would return the data shown in the following table:

first_name last_name birthday
Robert Hernandez 5/3/67
Steve Smith 8/4/84
Anne Raps 9/13/91
Alice Muller 4/15/75

Delimited Data with Column Headers

If your delimited file contains headers with the column names in the first row, you can configure Drill to read those names, as well. To do this, you will need to open the web interface and go back to the screen where you configured the storage plug-in. In this case, click the button to edit the dfs plug-in and then scroll down until you see the section labeled formats. You should see some JSON that looks like this:

"formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    }, ... 
    ,
    "csvh": {
      "type": "text",
      "extensions": [
        "csvh"
      ],
      "extractHeader": true,
      "delimiter": ","
    }

Drill includes a number of format plug-ins that parse data files and map data to Drill fields. The "text" format is the generic delimited data plug-in and provides you with a number of options to configure how Drill will read your files. Drill lets you give names to different collections of options, such as the "psv", "csv", and other formats in the previous example (see Table 4-2).

Table 4-2. Delimited data options
Options Description
comment Character used to start a comment line (optional)
escape Character used to escape the quote character in a field
delimiter Character used to delimit fields
quote Character used to enclose fields
skipFirstLine Skips the first line of a file if set to true (can be true or false)
extractHeader Reads the header from the CSV file if set to true (can be true or false)

If you would like Drill to extract the headers from the file, simply add the line "extractHeader": true, to the section in the configuration that contains the file type that you are working on. Drill also ships configured with the .csvh file type configured to accept CSV files with headers. To use this, simply rename your CSV files .csvh. If you want to try this out for yourself, execute this query:

SELECT *
FROM dfs.drill_tutorial.`customer_data-headers.csvh`

This yields the results shown in the following table:

first_name last_name birthday
Robert Hernandez 5/3/67
Steve Smith 8/4/84
Anne Raps 9/13/91
Alice Muller 4/15/75

Table Functions

You should use the configuration options listed in Table 4-2 to set the configuration for a given storage plug-in or directory. However, you can also change these options in a query for a specific file, by using a Drill table function as seen here:

SELECT *
FROM table(dfs.drill_tutorial.`orders.csv`(
    type => 'text', 
    extractHeader => true, 
    fieldDelimiter => ',')
)

When you use a table function, the query does not inherit any default values defined in the storage plug-in configuration, and thus you need to specify values for all the mandatory options for the format. In the previous example, the fieldDelimiter field is required.

Querying Directories

In addition to querying single files, you also can use Drill to query directories. All of the files in the directory must have the same structure, and Drill effectively performs a union on the files. This strategy of placing data into nested folders is known as partitioning and is very commonly used in big data systems in lieu of indexes.

Querying a directory is fundamentally no different than querying a single file, but instead of pointing Drill to an individual file, you point it to a directory. Suppose that you have a directory called logs that contains subdirectories for each year, which finally contain the actual log files. You could query the entire collection by using the following query:

SELECT *
FROM dfs.`/var/logs`

If your directory contains subdirectories, you can access the subdirectory names by putting the dirN into the query. Thus, dir0 would be the first-level directory, dir1 the second, and so on. For instance, suppose that you have log files stored in directories nested in the following format:

logs
|__ year
    |__month
       |__day

You could access the directory structure by using the following query:

SELECT 
    dir0 AS `year`,
    dir1 AS `month`,
    dir2 AS `day`
FROM dfs.`/var/logs/`

Additionally, you can use wildcards in the directory name. For instance, the following query uses wildcards to restrict Drill to opening only CSV files:

SELECT *
FROM dfs.`/user/data/*.csv`

Data Partitioning Depends on the Use Cases

The directory structure in this example is from an actual example from a client; however, this is not a good way to structure partitions to work with Drill because it makes querying date ranges very difficult as we discuss in Chapter 8. One better option might be to partition the data by date in ISO format. Ultimately, it depends on what you are trying to do with your data.

Directory functions

If you are using the dir variables with the MAX() or MIN() functions in the WHERE clause in a SQL query, it is best to use the directory functions rather than simple comparisons because it allows Drill to avoid having to make costly (and slow) full-directory scans. Drill has four directory functions:

  • MAXDIR()

  • MINDIR()

  • IMAXDIR()

  • IMINDIR()

These functions return the minimum or maximum values in a directory; IMAXDIR() and IMINDIR() return the minimum or maximum in case-insensitive order.

Directory Functions Have Limitations

The Drill directory functions perform a string comparison on the directory names. If you are trying to find the most recent directories, you cannot use these functions unless the most recent occurs first or last when compared as a string.

Using the previous example structure, the following query would find the most recent year’s directory in the logs directory and return results from that directory:

SELECT *
FROM dfs.`/var/logs`
WHERE dir0 = MAXDIR('dfs/tmp', '/var/logs')

Understanding Drill Data Types

One of the challenges in using Drill is that although it in many ways acts like a relational database, there are key areas in which it does not, and understanding these is key to getting the most out of Drill. This section will cover the fundamentals of analyzing delimited data with Drill. The examples in this section use a sample data file, baltimore_salaries_2015.csvh, which is included in the GitHub repository.2

Although Drill can infer the structure of data from the files, when you are querying delimited data Drill cannot implicitly interpret the data type of each field, and because there are no predefined schemas, you will need to explicitly cast data into a particular data type in your queries to use it for other functions. For instance, the following query fails because Drill cannot infer that the AnnualSalary field is a numeric field:

SELECT AnnualSalary,
FROM dfs.drill_tutorial.`csv/baltimore_salaries_2015.csvh`
WHERE AnnualSalary > 35000

This query fails with a NumberFormatException because the AnnualSalary field is a string. You can check the column types by using the typeof(<column>) function as shown in the following query:

SELECT AnnualSalary, typeof(AnnualSalary) AS column_type
FROM dfs.drill_tutorial.`baltimore_salaries_2015.csvh`
LIMIT 1

This query produces the following result:

GrossPay column_type

$53,626.04

VARCHAR

To fix this problem, you need to explicitly instruct Drill to convert the AnnualSalary field to a numeric data type. To use a comparison operator, both sides of the equation will need to be numeric, and to convert the field to a numeric data type, you will need to use the CAST(field AS data_type) function.

Table 4-3 lists the simple data types that Drill supports.

Table 4-3. Data types in Drill
Drill data types Description
BIGINT 8-byte signed integer in the range –9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
BINARY Variable-length byte string
BOOLEAN true or false
DATE Date
DECIMAL(p,s) 38-digit precision number; precision is p and scale is s
DOUBLE 8-byte floating-point number
FLOAT 4-byte floating-point number
INTEGER or INT 4-byte signed integer in the range –2,147,483,648 to 2,147,483,647
TIME Time
TIMESTAMP JDBC timestamp in year, month, date hour, minute, second, and optional milliseconds
VARCHAR UTF8-encoded variable-length string; the maximum character limit is 16,777,216

We discuss the date/time functions as well as complex data types later in the book, but it is important to understand these basic data types to do even the simplest calculations.

Suppose you would like to calculate some summary statistics about the various job roles present in this city. You might begin with a query like this:

SELECT JobTitle, AVG(AnnualSalary) AS avg_pay
FROM dfs.drillbook.`baltimore_salaries_2016.csvh`
GROUP BY JobTitle
ORDER BY avg_pay DESC

However, in Drill, this query will fail for the same reason as the earlier query: the GrossPay is not necessarily a number and thus must be converted into a numeric data type. The corrected query would be as follows:

SELECT JobTitle, AVG(CAST(AnnualSalary AS FLOAT))  AS avg_pay
FROM dfs.drillbook.`baltimore_salaries_2016.csvh`
GROUP BY JobTitle
ORDER BY avg_pay DESC

However, this query will unfortunately also fail because there are extraneous characters such as the $ in the AnnualSalary field. To successfully execute this query, you can either use the TO_NUMBER() function, as explained in the next section, or use one of Drill’s string manipulation functions to remove the $. A working query would be:

SELECT JobTitle, AVG(CAST(LTRIM(AnnualSalary, '\$') AS FLOAT)) AS avg_pay
FROM dfs.drillbook.`baltimore_salaries_2016.csvh`
GROUP BY JobTitle
ORDER BY avg_pay DESC

A cleaner option might be to break this up using a subquery, as shown here:

SELECT JobTitle, AVG(annual_salary) AS avg_pay
FROM
( SELECT JobTitle, 
   CAST(LTRIM(annual_salary, '\$') AS FLOAT) AS annual_salary  
   FROM dfs.drillbook.`baltimore_salaries_2016.csvh`
)
GROUP BY JobTitle
ORDER BY avg_pay DESC

Cleaning and Preparing Data Using String Manipulation Functions

When querying data in Drill, often you will want to extract a data artifact from a source column. For instance, you might want to extract the country code from a phone number. Drill includes a series of string manipulation functions that you can use to prepare data for analysis. Table 4-4 contains a listing of all the string manipulation functions in Drill.

Table 4-4. Drill string manipulation functions
Function Return type Description

BYTE_SUBSTR(string, start_index )

BINARY or VARCHAR

Returns in binary format a substring of a string

CHAR_LENGTH(string)

INT

Returns the number of characters in a string

CONCAT(list of strings)

VARCHAR

Concatenates strings

ILIKE(string, regex)

BOOLEAN

Returns true if string matches the regular expression (case insensitive)

INITCAP(string)

VARCHAR

Returns a string with the first letter of each word capitalized

LENGTH(string)

INT

Returns the length of a string

LOWER(string)

VARCHAR

Converts a string to lowercase

LPAD(string, length, fill text)

VARCHAR

Pads a string to the specified length by prepending the fill or a space

LTRIM(string 1, string 2)

VARCHAR

Removes characters in string 2 from the beginning of string 1

POSITION(needle IN haystack)

INT

Returns the index of the needle string in the haystack string

REGEXP_REPLACE(source, pattern, replacement)

VARCHAR

Replaces text in the source that matches the pattern with the replacement text

RPAD(string, length[, fill text])

VARCHAR

Pads a string to the specified length by appending the fill or a space

RTRIM(string 1, string 2)

VARCHAR

Removes characters in string 2 from the end of string 1

SPLIT(string, character)

Array of VARCHAR

Splits a string on the character argument

STRPOS(haystack, needle)

INT

Returns the index of the needle string in the haystack string

SUBSTR(string, x, y)

VARCHAR

Extracts a portion of a string from a position 1–x an optional y

TRIM([leading | trailing | both] [string1] from string2)

VARCHAR

Removes characters in string 1 from string 2

UPPER(string)

VARCHAR

Converts a string to uppercase

A use case for some of these functions might be trying to determine which domains are most common among a series of email addresses. The available data just has the email addresses; to access the domain it will be necessary to split each one on the @ sign. Assuming that we have a field called email, the function SPLIT(email, '@') would split the column and return an array containing the accounts and domains. Therefore, if we wanted only the domains, we could add SPLIT(email, '@')[1] to the end, which would perform the split and return only the last half. Thus, the following query will return the domain and the number of users using that domain:

SELECT SPLIT(email, '@')[1] AS domain, COUNT(*) AS user_count
FROM dfs.book.`contacts.csvh`
GROUP BY SPLIT(email, '@')[1]
ORDER BY user_count DESC

Complex Data Conversion Functions

If you have data that is relatively clean, the CAST() function works well; however, let’s look at a real-life situation in which it will fail and how to correct that problem with the TO_ methods.

Let’s look again at the baltimore_salaries_2015.csvh file. This data contains two columns with salary data: AnnualSalary and GrossPay. However, if you wanted to do any kind of filtering or analysis using these columns, you’ll notice that we have two problems that will prevent you from performing any kind of numeric operation on them:

  • Each salary begins with a dollar sign ($).

  • Each salary also has a comma separating the thousands field.

You already used string manipulation functions and CAST() to clean the data. This time, you will use the TO_NUMBER(field, format) function, which accomplishes both tasks in one step.

Table 4-5 lists the special characters that the format string accepts.

Table 4-5. Number formatting characters
Symbol Location Meaning

0

Number

Digit.

#

Number

Digit, zero shows as absent.

.

Number

Decimal separator or monetary decimal separator.

-

Number

Minus sign.

,

Number

Grouping separator.

E

Number

Separates mantissa and exponent in scientific notation. Does not need to be quoted in prefix or suffix.

;

Subpattern boundary

Separates positive and negative subpatterns.

%

Prefix or suffix

Multiplies by 100 and shows as percentage.

(030)

Prefix or suffix

Multiplies by 1,000 and shows as per-mille value.

¤ (0A4)

Prefix or suffix

Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.

'

Prefix or suffix

Used to quote special characters in a prefix or suffix; for example, "'#'#"" formats 123 to ""#123"". To create a single quote itself, use two in a row: "# o''clock".

Therefore, to access the salary columns in the Baltimore Salaries data, you could use the following query:

SELECT TO_NUMBER(AnnualSalary, '¤000,000') AS ActualPay
FROM dfs.drill_tutorial.`baltimore_salaries_2015.csvh`
WHERE TO_NUMBER(AnnualSalary, '¤000,000')  >= 50000

Reformatting numbers

You can also use the TO_CHAR() function to round numbers or otherwise reformat them into a more human-readable VARCHAR format. If you are using TO_CHAR() for this purpose, you’ll need to use the same formatting operators as with the TO_NUMBER() function.

Working with Dates and Times in Drill

Similar to numeric fields, if you have data that has dates or times that you would like to analyze, you first need to explicitly convert them into a format that Drill recognizes. For this section, we use a file called dates.csvh, which contains five columns of random dates in various formats (see Figure 4-3).

Figure 4-3. Results of a date/time query

There are several ways of converting dates into a usable format, but perhaps the most reliable way is to use the TO_DATE(expression, format) function. This function takes two arguments: an expression that usually will be a column, and a date format that specifies the formatting of the date that you are reading. Unlike many relational databases, Drill typically uses Joda—aka Java 8 Date Time or java.time format—formatting for the date strings for all date/time functions; however, there are several functions to convert dates that support ANSI format.3 See Appendix A for a complete list of Drill functions, and see Table B-2 for a list of Joda date/time formatting characters.

Converting Strings to Dates

To demonstrate how to convert some dates, in the file dates.csvh you will see that there are five columns with random dates in the following formats:

date1

ISO 8601 format

date2

Standard “American” date with month/day/year

date3

Written date format: three-letter month, date, year

date4

Full ISO format with time zone and day of week

date5

MySQL database format (yyyy-mm-dd)

The first one is easy. Drill’s native CAST() function can interpret ISO 8601–formatted dates, so the following query would work:

SELECT CAST(date1 AS DATE)
FROM dfs.drill_tutorial.`dates.csvh`

The second column is a little trickier. Here, the month is first and the day is second. To ingest dates in this format, you need to use the TO_DATE(expression, format_string) function, as shown here:

SELECT date2, TO_DATE(date2, 'MM/dd/yyyy')
FROM dfs.drill_tutorial.`dates.csvh`

The third column contains dates that are formatted more in “plain text”: Sep 18, 2016, for instance. For these dates you also need to use the TO_DATE() function, but with a slightly different format string:

SELECT date3, TO_DATE(date3, 'MMM dd, yyyy')
FROM dfs.drill_tutorial.`dates.csvh`

The fourth column introduces some additional complexity in that it has a time value as well as a time zone, as you can see here: Sun, 19 Mar 2017 00:15:28 –0700. It can be ingested using the TO_DATE() function; however, in doing so, you will lose the time portion of the timestamps. Drill has a function called TO_TIMESTAMP() that is similar to TO_DATE() but retains the time components of a date/time:

SELECT date4, TO_TIMESTAMP(date4, 'EEE, dd MMM yyyy HH:mm:ss Z')
FROM dfs.drill_tutorial.`dates.csvh`

The final column in this exercise, date5, has dates in the format of yyyy-MM-dd, which is very commonly found in databases. You can parse it easily using the TO_DATE() function, as shown in the following query:

SELECT date5, TO_DATE(date5, 'yyyy-MM-dd')
FROM dfs.drill_tutorial.`dates.csvh`

Reformatting Dates

If you have a date or a time that you would like to reformat for a report, you can use the TO_CHAR(date/time_expression, format_string) function to convert a Drill date/time into a formatted string. Like the TO_DATE() function, TO_CHAR() uses Joda date formatting. Table 4-6 presents the Drill date conversion functions.

Table 4-6. Drill date conversion functions
You have a... You want a... Use...

VARCHAR

DATE

TO_DATE() or CAST(field AS DATE)

VARCHAR

TIME

TO_TIMESTAMP()

DATE

VARCHAR

TO_CHAR(field, format string)

Date Arithmetic and Manipulation

Much like other databases, Drill has the ability to perform arithmetic operations on dates and times; however, Drill has a specific data type called an INTERVAL that is returned whenever you perform an operation on dates. When converted to a VARCHAR, an INTERVAL is represented in the ISO 8601 duration syntax as depicted here:

P [quantity] Y [quantity] M [quantity] D T [quantity] H [quantity] M [quantity] S
P [quantity] D T [quantity] H [quantity] M [quantity] S
P [quantity] Y [quantity] M

It is important to understand this format because it is returned by every function that makes some calculation on dates. To try this out, using the dates.csvh file discussed earlier, the following query calculates the difference between two date columns:

SELECT date2,
       date5,
       (TO_DATE(date2, 'MM/dd/yyyy') - TO_DATE(date5, 'yyyy-MM-dd')) as date_diff
FROM dfs.drill_tutorial.`dates.csvh`

The resulting date_diff column contains values such as these:

  • P249D

  • P-5D

  • P-312D

You can extract the various components of the INTERVAL by using the EXTRACT() function, the syntax for which is EXTRACT(part FROM field).

Date and Time Functions in Drill

Drill has a series of date and time functions that are extremely useful in ad hoc data analysis and exploratory data analysis, which you can see listed in Table 4-7.

Table 4-7. Drill date/time functions
Function Return type Description

AGE(timestamp)

INTERVAL

Returns the age of the given time expression.

CURRENT_DATE

DATE

Returns the current date.

CURRENT_TIME

TIME

Returns the current time.

CURRENT_TIMESTAMP

TIMESTAMP

Returns the current timestamp.

DATE_ADD(keyword, date, int or interval)

DATE or TIMESTAMP

Returns the sum of the date and the second argument. This function accepts various input formats.

DATE_PART(keyword, date)

DOUBLE

Returns a portion of a date expression, specified by the keyword.

DATE_SUB(keyword, date, int or interval)

DATE or TIMESTAMP

Returns the difference between the date and the second argument. This function accepts various input formats.

EXTRACT(field FROM expr)

DOUBLE

Extracts and returns a part of the given date expression or interval.

LOCALTIME

TIME

Returns the local time.

LOCALTIMESTAMP

TIMESTAMP

Returns the current local time.

NOW()

TIMESTAMP

Returns the current local time.

TIMEOFDAY()

VARCHAR

Returns the current time with the time zone.

UNIX_TIMESTAMP(optional date string)

BIGINT

Returns a Unix timestamp of the current time if a date string is not specified, or the Unix timestamp of the specified date.

Time Zone Limitation in Drill

At the time of writing, Drill does not support converting dates from one time zone to another. Drill uses the time zone from your operating system unless you override it, which can cause inconsistencies in your analysis if you are looking at data with timestamps from different time zones.

The best workaround at this point is to set up Drill to use UTC and to convert all your data to UTC, as well. To set Drill to use UTC you need to add the following line to the drill-env.sh file in the DRILL_PATH/conf/ folder.

export DRILL_JAVA_OPTS=\
"$DRILL_JAVA_OPTS -Duser.timezone=UTC"

After you’ve done that, restart Drill and any Drillbits that might be running and then execute the following query to verify that Drill is running in UTC time:

SELECT TIMEOFDAY() FROM (VALUES(1))

Creating Views

Like many relational databases, Drill supports the creation of views. A view is a stored query that can include many data sources, functions, and other transformations of the data. You can use views to simplify queries, control access to your data, or even mask portions of it. As with relational databases, creating a view does not store the actual data, only the SELECT statement to retrieve the data.

Here’s the basic syntax to create a view:

CREATE [ OR REPLACE] VIEW workspace.view_name AS query

You can also specify the columns after the view name. If you do not do this, the column names will be extrapolated from the query. After you have created a view, you can use the view name in the query as if it were a regular table.

For instance, let’s say that you have a CSV file called orders.csvh, and you want to create a view of this data with data types already specified. You could use the following query to do so:

CREATE VIEW dfs.customers.order_data AS
SELECT columns[1] customer__first_name,
    columns[2] AS customer_last_name,
    CAST(columns[3] AS INT) AS product_id
    columns[4] AS product_name
    CAST(columns[5] AS FLOAT) AS price
FROM dfs.data.`orders.csv` 

Let’s also say that you wanted to find each customer’s average purchase amount. You could find that with the following query:

SELECT customer_last_name, customer_first_name, AVG(price)
FROM dfs.customers.order_data
GROUP BY customer_last_name, customer_first_name

In this example, by creating a view, you can eliminate all the data type conversions and assign readable field names to the data.4

Data Analysis Using Drill

Now that you know how to get Drill to recognize dates as well as numeric fields, you are ready to begin analyzing data. Much like any database, Drill has a variety of mathematical functions that you can use in the SELECT statements to perform calculations on your data. You can apply one of these functions to a column or nest functions, and the result will be a new column. Table 4-8 shows the operations that Drill currently supports.

Table 4-8. Drill mathematical functions
Function Return type Description

ABS(x)

Same as input

Returns the absolute value of the input argument x.

CBRT(x)

DOUBLE

Returns the cubic root of x.

CEIL(x)

Same as input

Returns the smallest integer not less than x.

CEILING(x)

Same as input

Same as CEIL().

DEGREES(x)

DOUBLE

Converts x radians to degrees.

E()

DOUBLE

Returns 2.718281828459045.

EXP(x)

DOUBLE

Returns e (Euler’s number) to the power of x.

FLOOR(x)

Same as input

Returns the largest integer not greater than x.

LOG(x)

DOUBLE

Returns the natural log (base e) of x.

LOG(x, y)

DOUBLE

Returns log base x to the y power.

LOG10(x)

DOUBLE

Returns the common log of x.

LSHIFT(x, y)

Same as input

Shifts the binary x by y times to the left.

MOD(x, y)

DOUBLE

Returns the remainder of x divided by y.

NEGATIVE(x)

Same as input

Returns x as a negative number.

PI

DOUBLE

Returns pi.

POW(x, y)

DOUBLE

Returns the value of x to the y power.

RADIANS

DOUBLE

Converts x degrees to radians.

RAND

DOUBLE

Returns a random number from 0–1.

ROUND(x)

Same as input

Rounds to the nearest integer.

ROUND(x, y)

DECIMAL

Rounds x to y decimal places.

RSHIFT(x, y)

Same as input

Shifts the binary x by y times to the right.

SIGN(x)

INT

Returns the sign of x.

SQRT(x)

Same as input

Returns the square root of x.

TRUNC(x, y)

Same as input

Truncates x to y decimal places. Specifying y is optional; the default is 1.

TRUNC(x, y)

DECIMAL

Truncates x to y decimal places.

All of these functions require the inputs to be some numeric data type, so don’t forget to convert your data into a numeric format before using one of these columns. The following example demonstrates how to calculate the square root of a column using Drill’s SQRT() function:

SELECT x, SQRT(CAST(x AS FLOAT)) AS sqrt_x
FROM dfs.data.csv

Table 4-9 presents the results.

Table 4-9. SQRT() query results
x sqrt_x

4

2

9

3

16

4

Summarizing Data with Aggregate Functions

In addition to the standard mathematical functions, Drill also includes the aggregate functions that follow, which perform a calculation on an entire column or grouping within a column. These functions are extremely useful when summarizing a dataset. A complete explanation of grouping in SQL is available in Chapters 13 and 14 of Viesca’s SQL Queries for Mere Mortals (Addison-Wesley). However, the essential structure of a query that summarizes data is as follows:

SELECT unique field, aggregate function(field)
FROM data
WHERE logical condition (Optional)
GROUP BY unique field or fields
HAVING logical condition (Optional)

You can use aggregate functions to summarize a given column, and in these instances the aggregate function will return one row. Table 4-10 lists the aggregate functions included with Drill as of this writing.

Table 4-10. Drill aggregate functions
Function Return type Description

AVG(column)

Same as argument type

Returns the average of the expression

COUNT(*)

BIGINT

Returns the number of rows

COUNT(DISTINCT)

BIGINT

Returns the number of unique values and the number of times they occurred

MIN(column)

Same as argument type

Returns the smallest value in the expression

STDDEV(column)

DECIMAL

Returns the sample standard deviation of the column

STDDEV_POP(column)

DECIMAL

Returns the population standard deviation of the input column

STDDEV_SAMP(column)

DECIMAL

Returns the sample standard deviation of the input column

SUM(column)

Same as argument type

Returns the sum of the given column

VARIANCE(column)

DECIMAL

Returns the sample variance of the input values (sample standard deviation squared)

VAR_POP(column)

DECIMAL

Returns the population variance of the input values (the population standard deviation squared)

VAR_SAMP(column)

DECIMAL

Sample variance of input values (sample standard deviation squared)

As an example, this book’s GitHub repository contains a data file called orders.csvh which contains notional sales data for one month. If you wanted to calculate some monthly stats, you could use the following query to do so:

SELECT
    SUM(CAST(purchase_amount AS FLOAT)) AS monthly_total,
    MAX(CAST(purchase_amount AS FLOAT)) AS largest_order,
    MIN(CAST(purchase_amount AS FLOAT)) AS smallest_order,
    COUNT(order_id) AS order_count,
    AVG(CAST(purchase_amount  AS FLOAT)) AS average_order
FROM dfs.drill_tutorial.`orders.csvh`

Figure 4-4 shows the results.

Figure 4-4. Stats query result

As you can see, the query calculates the total monthly sales, the largest order, the smallest order, the number of orders, and the average order amount for the month.

If you wanted to summarize this data by day, all you would need to do is add a GROUP BY statement at the end of the query, as shown in the following example:

SELECT 
    EXTRACT(day FROM TO_DATE(order_date, 'yyyy-MM-dd')) AS day_of_month,
    SUM(CAST(purchase_amount AS FLOAT)) AS daily_total,
    MAX(CAST(purchase_amount AS FLOAT)) AS largest_order,
    MIN(CAST(purchase_amount AS FLOAT)) AS smallest_order,
    COUNT(order_id) AS order_count,
    AVG(CAST(purchase_amount AS FLOAT)) AS average_order
FROM dfs.drill_tutorial.`orders.csvh`
GROUP BY TO_DATE(order_date, 'yyyy-MM-dd')
ORDER BY EXTRACT(day FROM TO_DATE(order_date, 'yyyy-MM-dd'))

Figure 4-5 illustrates the results of this query.

Figure 4-5. Data aggregated by day

If you wanted to present this data, you could format the numbers with a currency symbol and round them to the nearest decimal place by using the TO_NUMBER() function.

One major difference between Drill and relational databases is that Drill does not support column aliases in the GROUP BY clause. Thus, the following query would fail:

SELECT YEAR(date_field) AS year_field, COUNT(*)
FROM dfs.drill_tutorial.`data.csv`
GROUP BY year_field

To correct this you would simply need to replace GROUP BY year_field with GROUP BY YEAR(date_field).

Beware of Incorrect Column Names

If you include a field in the SELECT clause of a query that does not exist in your data, Drill will create an empty nullable INT column instead of throwing an error.

Other analytic functions: Window functions

In addition to the simple aggregate functions, Drill has a series of analytic functions known as window functions, which you can use to split the data and perform some operations on those sections. Although the window functions are similar to the aggregate functions used in a GROUP BY query, there is a difference in that window functions do not alter the number of rows returned. In addition to aggregating data, these functions are very useful for calculating metrics that relate to other rows, such as percentiles, rankings, and row numbers that do not necessarily need aggregation.

The basic structure of a query using a window function is as follows:

SELECT fields, FUNCTION(field)
OVER(
   PARTITION BY field 
   ORDER BY field (Optional) 
)

The PARTITION BY clause might look unfamiliar to you, but this is where you define the window or segment of the data upon which the aggregate function will operate.

In addition to AVG(), COUNT(), MAX(), MIN(), and SUM(), Drill includes several other window functions, as shown in Table 4-11.

Table 4-11. Drill window functions
Function Return type Description

CUME_DIST()

DOUBLE

Calculates the relative rank of the current row within the window: rows preceding/total rows.

DENSE_RANK()

BIGINT

Returns the rank of the row within the window. Does not create gaps in the event of duplicate values.

FIRST_VALUE()

Same as input

Returns the first value in the window.

LAG()

Same as input

Returns the value after the current value in the window, NULL if none exists.

LAST_VALUE()

Same as input

Returns the last value in the window.

LEAD()

Same as input

Returns the value prior to the current value in the window, or NULL if none exists.

NTILE()

INT

Divides the rows for each window into a number of ranked groups. Requires the ORDER BY clause in the OVER() clause.

PERCENT_RANK()

DOUBLE

Calculates the percent rank of the current row within the window: (rank – 1) / (rows – 1).

RANK()

BIGINT

Similar to DENSE_RANK() but includes rows with equal values in the calculation.

ROW_NUMBER()

BIGINT

Returns the number of the row within the window.

Comparison of aggregate and window analytic functions

The GitHub repository for this book contains a file called student_data.csvh, which contains some notional test scores from 10 different students. If you wanted to find each student’s average test score you could use either GROUP BY or window functions, but the GROUP BY option would be a simpler query. The following query accomplishes this task:

SELECT studentID, 
    AVG(CAST(score AS FLOAT)) AS average_score,
    MIN(CAST(score AS FLOAT)) AS min_score,
    MAX(CAST(score AS FLOAT)) AS max_score
FROM dfs.drill_tutorial.`student_data.csvh`
GROUP BY studentID

Figure 4-6 demonstrates that this query produces the desired result.

Figure 4-6. Aggregate query result

You could accomplish the same thing using window functions by using the following query:

SELECT DISTINCT studentID, 
    AVG(CAST(score AS FLOAT)) 
    OVER(
      PARTITION BY studentID 
    ) AS avg_score,
    MIN(CAST(score AS FLOAT))  
    OVER(
      PARTITION BY studentID) AS min_score, 
    MAX(CAST(score AS FLOAT))  
    OVER(
      PARTITION BY studentID 
    ) AS max_score
FROM dfs.drill_tutorial.`student_data.csvh`

Note that you must use the DISTINCT keyword; otherwise, you will get duplicate rows. If you are working with large datasets, it is useful to be aware of the different methods of summarizing datasets because one might perform better on your data than the other.

Another example would be ranking each student’s score per test and calculating their quartile. In this example, it is not possible to do this using GROUP BY, so we must use the window functions RANK() and NTILE(). The following query demonstrates how to do this:

SELECT studentID, 
    testnumber, 
    score, 
    RANK() OVER(
      PARTITION BY testnumber 
      ORDER BY CAST(score  AS FLOAT) DESC 
    ) AS ranking, 
    NTILE(4) OVER(PARTITION BY testnumber) AS quartile
FROM dfs.drill_tutorial.`student_data.csvh`

Figure 4-7 presents the results of this query.

Figure 4-7. Window query results

You cannot use the results from window functions directly in a WHERE clause; thus, if you want to filter results from a window function, you must place the original query in a subquery, as shown here:

SELECT * 
FROM (
  SELECT 
   studentID, testnumber, score, 
   RANK() OVER(
    PARTITION BY testnumber 
    ORDER BY CAST(score  AS FLOAT) DESC) AS ranking, 
    NTILE(4) OVER(PARTITION BY testnumber) AS quartile
  FROM dfs.drill_tutorial.`student_data.csvh`) AS student_data
WHERE student_data.quartile <= 2

Figure 4-8 illustrates what this query returns.

Figure 4-8. Window query results with filter

Common Problems in Querying Delimited Data

Because Drill allows you to query raw delimited data, it is quite likely that you will encounter certain problems along the way. We’ll look at how Drill processes data in more detail in Chapter 8, but for now, a little understanding of Drill can help you quickly get around these problems.

Spaces in Column Names

If your data has spaces or other reserved characters in a column name, or the column name itself is a reserved word, this will present a problem for Drill when you try to use the column name in a query. The easiest solution for this problem is to enclose all the column names in backticks (`) throughout your query.

For instance, this query will throw an error:

SELECT customer name, dob
FROM dfs.`customers.csv`

This query will not:

SELECT `customer name`, `dob`
FROM dfs.`customers.csv`

Sometimes, columns can have spaces at the end of the field name. Enclosing the field name in backticks can solve this problem, as well—for example:

SELECT `field_1 `
FROM ...

Illegal Characters in Column Headers

We encountered this problem a while ago, and it took quite some time to diagnose. Essentially, we were querying a series of CSV files, and although we could query the files using SELECT * or most of the field names, the field we were interested in happened to be the last field in the row. When we tried to query that field specifically, we got an index out of bounds exception.

The reason for the error was that some of the CSV files were saved on a Windows machine, which encodes new lines with both a carriage return character (\r) and a newline character (\n), whereas Macs and Linux-based systems encode line endings with just the newline character (\n). In practice, what this meant was that the last field had an extra character in the field name; when we tried to query it without that character, Drill couldn’t find the field and threw an exception.

The easiest solution is to make sure that all your files are encoded using the Mac/Linux standard; that way, you won’t have any such problems.

Reserved Words in Column Names

A rather insidious problem that can bite you if you are not careful is when your data or column aliases are Drill reserved words.5 Often, you will have a data file that has a column called year or date. Because both of these are reserved words, Drill will throw errors if you attempt to access these columns in a query.

The solution to this problem is twofold. First, be aware of the Drill reserved words list, and second, as with the other situations, if you enclose a field name in backticks it will allow you to use reserved words in the query. We recommend that if you have a data file with reserved words as column names, you should either rename them in the file or in the query with an AS clause. A good rule of thumb is to always use backticks around all column names.

Column Aliases

Drill treats column aliases a little differently than a relational database. In general, Drill is more restrictive than most relational databases as to where you can use a column alias, with the biggest difference being that you cannot use column aliases in a GROUP BY clause.

Conclusion

In this chapter, you learned how to query basic delimited data and to use Drill’s vast array of analytic functions to clean, prepare, and summarize that data for further analysis. In Chapter 5, you’ll take your analysis to the next level and learn how to apply these skills to nested datasets in formats such as JSON and Parquet, and other advanced formats.

1 The label “dfs” is actually arbitrary, and you can rename it whatever you want it to be. We will cover this later in this chapter.

2 This data is also available on the US government’s open data website.

3 For more information, see Oracle’s documentation on the DateTimeFormatter class.

4 Complete documentation for views in Drill is available on the Drill website.

5 You can find the complete list of Drill reserved words at the Apache Drill website.

Get Learning Apache Drill now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.