O'Reilly logo
live online training icon Live Online training

Data Engineering at Scale with Apache Hadoop and Spark: Effective Data Pipelines

Learn how to transform raw data into an effective feature matrix for further processing

Topic: Data
Douglas Eadline

Data Engineering is an important step when developing Big Data analytics projects. As part of the Effective Data Pipelines series, this course provides background and examples on data "munging" or transforming raw data into a form that can be used with analytical modeling libraries. Also referred to as data wrangling, transformation, or ETL these techniques are often performed "at scale" using Hadoop and Spark.

A little-known fact about data analytics projects is that about 70%–80% of the time is spent identifying and remediating data quality problems and transforming the raw data into what is known as a feature matrix. Data analytics cannot start until a usable feature matrix is completed.

What you'll learn-and how you can apply it

  • Understand the steps needed for evaluating raw data at scale (including visualization)
  • Learn how to use the Apache Zeppelin Web Notebook to perform data munging.
  • Understand how to test data quality and methods to remediate issues
  • Understand the feature matrix and how to derive and aggregate features
  • Understand feature scaling, one hot encoding, and over/under fitting
  • Learn how to use scalable sampling techniques with Apache Hive and PySpark
  • Learn how to extract "named entities" features from text streams

This training course is for you because...

  • You want to understand the methods for cleaning and transforming data
  • You want to learn how to create an effective feature matrix for analytical modeling
  • Worked examples are provided in the form of Apache Zeppelin Web Notebooks that provide a reference guide with examples after the course is complete
  • You want to learn how to apply data munging at scale using Apache Hadoop and Spark
  • A Linux Hadoop virtual machine is provided so you can run the (simple) examples using a real analytics environment.


  • Beginning/Intermediate Linux Command Line for Data Engineers (Live Online Training; search the O’Reilly learning platform for upcoming dates),
  • Apache Hadoop, Spark, and Kafka Foundations (Live Online Training; search the O’Reilly learning platform for upcoming dates),
  • Hands-on Introduction to Apache Hadoop, Spark, and Kafka Programming (Live Online Training; search the O’Reilly learning platform for upcoming dates)
  • If you have no experience with any of the prerequisite courses, you may find this course difficult to follow. This course is also a prerequisite for the final course in the series; Scalable Analytic Modeling with Apache Hadoop, Spark, and Kafka.

Setup Instructions:

  • To run the class examples, a Linux Hadoop Minimal Virtual Machine (VM) is available. The VM is a full Linux installation that can run on your laptop/desktop using VirtualBox (freely available). The VM provides a functional Hadoop/Hive/Spark environment to learn data engineering. It also includes the Zeppelin Web Notebook that is used in the class.
  • Further information on the class, access to the class notes, and the Linux Hadoop Minimal VM can be found at https://www.clustermonkey.net/scalable-analytics.
  • If you wish to follow along during class, install and test the sandbox at least one day before the class starts.

About your instructor

  • Douglas Eadline, PhD, began his career as an analytical chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written instructional documents covering many aspects of Linux HPC (High Performance Computing) and Hadoop computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also an active writer and consultant to the HPC/Analytics industry. His recent video tutorials and books include of the Hadoop and Spark Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley).


The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Introduction and Course Goals (10 mins)

  • Class Resources and web page
  • How to get the most out of this course
  • Required prerequisite skills

Segment 2: The Linux Hadoop Minimal Virtual Machine (10 mins)

  • Linux Hadoop Minimal resources
  • Starting the Virtual Machine with Oracle Virtual Box
  • Connecting the VM using SSH

Segment 3: Apache Zeppelin Web Notebook (10 mins)

  • Tour of basic features
  • Downloading and installing the course notebook

Segment 4: Visualizing Data with R and Python (20)

  • Types of charts
  • Example visualizations

Segment 5: Data Munging Concepts (20 mins)

  • Data quality
  • Dealing with data quality Issues
  • Using data at scale
  • Break (10 mins)

Segment 6: The Feature Matrix (30 mins)

  • Simple Features
  • Derived and Aggregated Features
  • Scaling Features
  • One hot encoding
  • Reducing dimensions

Segment 7: Sampling Techniques (15 mins)

  • Sampling with Apache Hive
  • Sampling with Apache PySpark

Segment 8: Example: Deriving and Aggregating Features (20 mins)

  • Derived Feature with PySpark
  • Aggregating Features with Hive

Segment 9: Example: Finding Text Features (30 mins)

  • Named Entity Extraction
  • Word Vectorization

Segment 10: Course Wrap-up, Questions, and Additional Resources (5 mins)