Data Exploration with Apache Drill
Quick and Easy Manipulation and Analysis of Multiple Data Formats (at Scale)
Join Charles Givre for a hands on introduction to data exploration with Apache Drill. Becoming a data-driven business means using all the data you have available, but a common problem in many organizations is that data is not optimally arranged for ad-hoc analysis.
Through a combination of lecture and hands-on exercises, you'll gain the ability to access previously inaccessible data sources and analyze them with ease. You’ll learn how to use Drill to query and analyze structured data, connect multiple data sources to Drill, and perform cross-silo queries.
Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
What you'll learn-and how you can apply it
- How to quickly and efficiently analyze data using Drill
- Key differences between Drill and a relational database
- Drill’s strengths and weaknesses
And you’ll be able to:
- Quickly cleanse, manipulate, and analyze data using Drill
- Access Drill programmatically using Python or R
- Use Drill to connect to and query multiple data sources
This training course is for you because...
You are a data analyst with experience querying data in SQL and you need to access a wide variety of data sources.
You are a database administrator and you need to offer access to non-relational data to your analysts.
- You should be generally familiar with SQL as well as common data formats
Before the course begins, you'll need to:
- Download and install Virtualbox
- Download virtual machine and verify that you can start it
- Clone course repository onto the virtual machine
- Familiarize yourself with basic SQL statements.
Download Virtualbox at virtualbox.org (or any other virtualization software) Download the virtual machine (VM) at: http://bit.ly/griffon1
Once you’ve downloaded the virtual machine, please start the VM and clone the class repository. To do this, open the terminal prompt and type:
git clone https://github.com/cgivre/data-exploration-with-apache-drill
Note: The focus of this class is not SQL, and as such we will not be executing complex queries in this class, but it is important to understand the fundamentals of SQL.
About your instructor
Mr. Charles Givre has always been interested solving problems in unique ways, and has worked to make a career of it as a data scientist at Booz Allen Hamilton. At Booz Allen, Mr. Givre worked as a technical leader on various large government projects. Mr. Givre enjoys sharing his passion for data science with others and has worked to develop comprehensive data science training programs at his firm. Prior to joining Booz Allen, Mr. Givre worked as a counterterrorism analyst at the Central Intelligence Agency for nearly five years.
Mr. Givre got interested in Apache Drill several years ago, and is co-author of the first O’Reilly book about Drill. He has delivered numerous workshops about Drill and has contributed to the codebase. Mr. Givre is a sought-after speaker and has delivered training and talks at international conferences such as BlackHat, Strata +Hadoop World, Open Data Science Conference (ODSC) and others. Mr. Givre holds a Master of Arts from Brandeis University in Middle Eastern Studies, a Bachelor of Science in Computer Science and a Bachelor of Music both from the University of Arizona. Mr. Givre also holds a CISSP, Security+ and various other certifications. Mr. Givre blogs at thedataist.com and in his non-existant spare time, Mr. Givre enjoys spending time with his family and restoring classic cars.
The timeframes are only estimates and may vary according to how the class is progressing
Introduction: An Overview of Drill (20 Min)
- What does Drill do?
- How does Drill work?
- Kinds of data which can be queried with Drill
Installing & Configuring Drill (20 Min)
- Comparison of embedded and distributed modes
- Introducing and configuring workspaces
- Demonstrate Drill’s various interfaces
- Exercise: Create a workspace for the course materials.
Break (10 Min)
Querying Simple Delimited Data (20 Min)
- A 10 min crash course in SQL
- Querying a simple CSV flle
- Arrays in Drill
- Accessing columns in Arrays
- Exercise: Create a report containing an individual’s first name, last name, department and gross pay from the Baltimore salaries data set
Configuration Options (10 Min)
- Extracting headers from csv files
- Changing delimiter characters
- Specifying options in a query
- Exercise: Rewrite the previous query so that Drill is extracting the column headers.
Understanding Data Types and Functions in Drill (20 Min)
- Overview of Drill Data Types
- Converting Strings to Numeric Data Types
- Complex Conversions
- Windowing functions
- Exercise: Using the Baltimore Salaries dataset, calculate the average salary for each job category
Q&A (20 Min)
End of Day 1
TBD: Homework assignment
Working with Dates and Times in Drill (20 Min)
- Understanding dates and times in Drill
- Converting strings to dates
- Reformatting dates
- Intervals and date/time arithmetic in Drill
- Exercise: Using the file dates.csv, complete the worksheet on date conversions
Analyzing Nested Data with Drill (20 Min)
- Issues querying nested data with Drill
- Maps and Arrays in Drill
- Querying deeply nested data in Drill
- Exercise: Using the Baltimore data set in JSON format, recreate the queries we have already done
Other Data Types (20 Min)
- Log files
- Exercise: Cyber worksheet
Connecting Multiple Data Sources (20 Min)
- Issues when using other data sources
Programmatically Connecting to Drill (20 Min)
Q&A (20 Min)