O'Reilly logo
live online training icon Live Online training

Scalable Data Science with Apache Hadoop and Spark

Learn How to Apply Hadoop and Spark Tools to Predict Airline Delays

Douglas Eadline

A complete data science investigation requires different tools and strategies. In this course we develop a data science model to predict airline delays using historical data. All programming will be done using Hadoop and Spark with the Zeppelin web notebook. The notebook will be made available for download so student can reproduce the examples. Processing will take place on a small four-node Hadoop/Spark cluster. The course will lean heavily on skills and concepts provided in several previous courses by the instructor. Courses in this series include

  1. Practical Linux Command Line for Data Engineers and Analysts
  2. Apache Hadoop, Spark and Big Data Foundations
  3. Hands-on Introduction to Apache Hadoop and Spark Programming
  4. Scalable Data Science with Apache Hadoop and Spark

What you'll learn-and how you can apply it

  • Understand how a data science project is developed
  • Learn how to review raw data and develop a feature matrix (data munging)
  • Learn how to use the Zeppelin web notebook
  • Understand how to use Pig, Hive, and PySpark for data science
  • Learn how an interactive data science process works
  • Understand how to use Spark ML-lib

This training course is for you because...

  • You want to learn how to use various analytics tools together. These include Apache Pig, PySpark, Hive, and the Zeppelin web notebook.


It is highly recommended that prospective students take the following Live Online Training classes as prerequisites--or have adequate background in these areas. These classes are designed to support the examples presented in this class. Search the O'Reilly platform for latest offerings for the following courses: - Practical Linux Command Line for Data Engineers and Analysts - Apache Hadoop, Spark and Big Data Foundations - Hands-on Introduction to Apache Hadoop and Spark Programming

Recommended Preparation

Course Set-up

Students can try some of the class examples (using smaller data sets) with a Linux Hadoop Minimal Virtual Machine (VM). The VM is a full Linux installation that can run on your laptop/desktop using VirtualBox (freely available). Course descriptions and available links are available here: https://www.clustermonkey.net/scalable-analytics/doku.php?id=start

If you wish to follow along, install and test the sandbox at least one day before the class.

About your instructor

  • Douglas Eadline, PhD, began his career as an analytical chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written instructional documents covering many aspects of Linux HPC (High Performance Computing) and Hadoop computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also an active writer and consultant to the HPC/Analytics industry. His recent video tutorials and books include of the Hadoop and Spark Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley).


The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Introduction, Course Goals, Set-Up (30 mins)

  • How to get the most out of this course
  • The iterative data science project life cycle
  • Project Background and Attribution
  • Cluster Computing Environment
  • Alternative: Linux Hadoop Minimal VM


Segment 2: Exploring the Data Set (20 mins)

  • Using the Zeppelin Web Notebook
  • Data Source and HDFS
  • Visualizing and exploring raw data


Break (5 mins)

Segment 3: Pre-processing: using Hadoop to build a feature matrix (25 mins)

  • Selecting possible predictive variables
  • Using Pig for pre-processing and feature matrix generation


Segment 4: Iteration One: Building a Logistic Regression and Random Forest Models (35 mins)

  • Background on Regression and Ransom Forest Models
  • Evaluating progress
  • Using a Logistic Regression
  • Using Random Forest Model


Break (5 mins)

Segment 5: Iteration Two: Improving the model with "One Hot Encoding" (OHE) (20 mins)

  • Background on One Hot Encoding (OHE)
  • Enhancing the model with OHE


Segment 6: Iteration Three: Enriching the Model with More Data (25 mins)

  • Adding weather data
  • Augmenting the feature matrix
  • Final model results
  • Possible enhancements


Segment 7: Course Wrap-up and Next Steps (15 mins)

  • Additional Resources
  • Accessing the class cluster
  • Remaining questions