Skip to content
  • Sign In
  • Try Now
View all events
Apache Drill

Data Exploration with Apache Drill

Published by O'Reilly Media, Inc.

Beginner to intermediate content levelBeginner to intermediate

Quick and Easy Manipulation and Analysis of Multiple Data Formats (at Scale)

Join Charles Givre for a hands on introduction to data exploration with Apache Drill. Becoming a data-driven business means using all the data you have available, but a common problem in many organizations is that data is not optimally arranged for ad-hoc analysis.

Through a combination of lecture and hands-on exercises, you'll gain the ability to access previously inaccessible data sources and analyze them with ease. You’ll learn how to use Drill to query and analyze structured data, connect multiple data sources to Drill, and perform cross-silo queries.

Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.

What you’ll learn and how you can apply it

  • How to quickly and efficiently analyze data using Drill
  • Key differences between Drill and a relational database
  • Drill’s strengths and weaknesses

And you’ll be able to:

  • Quickly cleanse, manipulate, and analyze data using Drill
  • Access Drill programmatically using Python or R
  • Use Drill to connect to and query multiple data sources

This live event is for you because...

You are a data analyst with experience querying data in SQL and you need to access a wide variety of data sources.

You are a database administrator and you need to offer access to non-relational data to your analysts.

Prerequisites

  • You should be generally familiar with SQL as well as common data formats

Before the course begins, you'll need to:

  • Download and install Virtualbox
  • Download virtual machine and verify that you can start it
  • Clone course repository onto the virtual machine
  • Familiarize yourself with basic SQL statements.

Setup Instructions:

Download Virtualbox at virtualbox.org (or any other virtualization software) Download the virtual machine (VM) at: http://bit.ly/griffon1

Once you’ve downloaded the virtual machine, please start the VM and clone the class repository. To do this, open the terminal prompt and type:

git clone https://github.com/cgivre/data-exploration-with-apache-drill

Note: The focus of this class is not SQL, and as such we will not be executing complex queries in this class, but it is important to understand the fundamentals of SQL.

Recommended Preparation

Getting Started with SQL

Learning SQL

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Introduction: An Overview of Drill (20 Min)

  • What does Drill do?
  • How does Drill work?
  • Kinds of data which can be queried with Drill
  • Q&A

Installing & Configuring Drill (20 Min)

  • Comparison of embedded and distributed modes
  • Introducing and configuring workspaces
  • Demonstrate Drill’s various interfaces
  • Exercise: Create a workspace for the course materials.

Break (10 Min)

Querying Simple Delimited Data (20 Min)

  • A 10 min crash course in SQL
  • Querying a simple CSV flle
  • Arrays in Drill
  • Accessing columns in Arrays
  • Exercise: Create a report containing an individual’s first name, last name, department and gross pay from the Baltimore salaries data set

Configuration Options (10 Min)

  • Extracting headers from csv files
  • Changing delimiter characters
  • Specifying options in a query
  • Exercise: Rewrite the previous query so that Drill is extracting the column headers.

Understanding Data Types and Functions in Drill (20 Min)

  • Overview of Drill Data Types
  • Converting Strings to Numeric Data Types
  • Complex Conversions
  • Windowing functions
  • Exercise: Using the Baltimore Salaries dataset, calculate the average salary for each job category

Q&A (20 Min)

End of Day 1

TBD: Homework assignment

DAY TWO

Working with Dates and Times in Drill (20 Min)

  • Understanding dates and times in Drill
  • Converting strings to dates
  • Reformatting dates
  • Intervals and date/time arithmetic in Drill
  • Exercise: Using the file dates.csv, complete the worksheet on date conversions

Analyzing Nested Data with Drill (20 Min)

  • Issues querying nested data with Drill
  • Maps and Arrays in Drill
  • Querying deeply nested data in Drill
  • Exercise: Using the Baltimore data set in JSON format, recreate the queries we have already done

Other Data Types (20 Min)

  • Log files
  • HTTPD
  • Exercise: Cyber worksheet

Connecting Multiple Data Sources (20 Min)

  • MySQL
  • Hadoop
  • MongoDB
  • Issues when using other data sources

Programmatically Connecting to Drill (20 Min)

  • Python
  • R

Q&A (20 Min)

Your Instructor

  • Charles Givre

    Charles Givre is a lead data scientist in the Cybersecurity Technology and Controls Group at JPMorgan Chase, where he works at the intersection of cybersecurity and data science. Previously, he was a senior lead data scientist at Booz Allen Hamilton on one of the firm's largest analytic programs, where he led data science efforts and worked to expand the role of data science in the program, and worked as a counterterrorism analyst at the Central Intelligence Agency for five years. One of his research interests is increasing the productivity of data science and analytic teams; to that end, he’s been working extensively to promote the use of Apache Drill in security applications and has contributed to the codebase. He’s also a coauthor of Learning Apache Drill from O’Reilly.

    Charles is passionate about teaching others data science and analytic skills and has led data science classes all over the world for clients, universities, and conferences, including Black Hat and the Center for Research in Applied Cryptography and Cyber Security at Bar-Ilan University. A sought-after speaker, he’s also delivered presentations at major industry conferences such as Strata-Hadoop World, Open Data Science Conference, and others. He recently served as program chair of the Strategic Analytics Program at Brandeis University's Graduate School of Professional Studies and is currently a member of the advisory board. He holds a master’s degree in Middle Eastern studies from Brandeis University as well as both a bachelor of science in computer science and a bachelor of music from the University of Arizona. Charles speaks French reasonably well and plays trombone. He lives in Baltimore with his family and in his nonexistent spare time is restoring a classic British sports car.

    linkedinXlinksearch