O'Reilly logo
live online training icon Live Online training

Hands-on Introduction to Apache Hadoop, Spark, and Kafka Programming: Effective Data Pipelines

A quick-start introduction to the important facets of big data analytics

Douglas Eadline

The live training course will provide a "first touch" hands-on experience needed to start using essential tools in the Apache Hadoop and Spark ecosystem. Tools that will be presented include Hadoop Distributed File Systems (HDFS) Apache Pig, Hive, Sqoop, Spark, Kafka, and the Zeppelin web notebook. The topics are presented in a "soup-to-nuts" fashion with minimal assumptions about prior experience. Examples are run from two vantage points: the command line and the Zeppelin we notebook. As part of the course, students can download a small Hadoop/Spark virtual machine to run the course examples (including the Zeppelin notebook). After completing the course attendees will gain the skills needed to begin their own analytics projects.

What you'll learn-and how you can apply it

  • Navigate and use the Hadoop Distributed File Systems (HDFS)
  • Learn how to run, monitor, inspect, and stop applications in a Hadoop environment
  • Learn how to start and run Apache Hive and Spark applications from the command line
  • Use Sqoop to import/export databases into HDFS
  • Learn how to configure and use Kafka as a data broker
  • Start and use the Zeppelin Web GUI for Hive and Spark application development

This training course is for you because...

  • Beginning developers who want to quickly learn how to navigate the Hadoop, Spark, and Kafka development environment.
  • Administrators who are tasked with providing and supporting a Hadoop/Spark/Kafka environment to their organization.
  • Data Scientists that do not have experience with scalable tools like Hadoop, Spark, Sqoop, Kafka or the Zeppelin web notebook.
  • You require the ability to work with code and class examples at your own pace after the class is complete.

Prerequisites

This course moves at a fast pace. It is HIGHLY RECOMMENDED that the following courses be taken before attempting this course (or you have demonstrated competency in these areas):

  1. Apache Hadoop, Spark, and Kafka Foundations (Live Online Training)
  2. Beginning Linux Command Line for Data Engineers and Analysts (Live Online Training)

The following video and book also provide background on some of these topics. - Video: https://www.safaribooksonline.com/library/view/hadoop-fundamentals-livelessons/9780134052489/ - Book: https://www.safaribooksonline.com/library/view/hadoop-2-quick-start/9780134050119/

Setup Instructions:

To run the examples during and after the class, a Linux Hadoop Minimal Virtual Machine (VM) is available. The VM is a full Linux installation that can run on your laptop/desktop using Oracle VirtualBox (freely available). Further information on class resources, access to the class notes, and the Linux Hadoop Minimal VM can be found at https://www.clustermonkey.net/scalable-analytics

About your instructor

  • Douglas Eadline, PhD, began his career as an analytical chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written instructional documents covering many aspects of Linux HPC (High Performance Computing) and Hadoop computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also an active writer and consultant to the HPC/Analytics industry. His recent video tutorials and books include of the Hadoop and Spark Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley).

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

DAY 1

Segment 1: Introduction and Quick Overview of Hadoop and Spark (45 mins)

  • Instructor explains how course will work
  • This section provides a brief background on Hadoop, Spark, and Kafka

Segment 2: Installing and Running the Linux Hadoop Minimal Virtual Machine (20 mins)

  • Virtual Machine background
  • Downloading and installing the VM
  • Starting the virtual machine on Windows, Mac, and Linux

Segment 3: Using the Hadoop Distributed File System (HDFS) (25 mins)

  • Demonstrate how to use basic HDFS commands
  • Show how to use the HDFS web GUI
  • Contrast real cluster versus the Linux Hadoop Minimal Virtual Machine.

Break: 10 minutes

Segment 4: Running and Monitoring Hadoop Applications (40 mins)

  • Instructor demonstrates how to run Hadoop example applications and benchmarks
  • A live tour of the YARN web GUI will be presented for a running application

Segment 5: Running Apache Sqoop (40 mins)

  • Moving data from MySQL to Hadoop/HDFS and back to MySQL will be demonstrated
  • Various Sqoop options will be demonstrated including using multiple mappers

DAY 2

Segment 6: Using Apache Hive (30 mins)

  • Instructor will demonstrate a simple interactive Hive-SQL example
  • Running the same example from a script will also be presented

Segment 7: Running Apache Spark (pySpark) (50 mins)

  • The interactive pySpark word count example will be explained to illustrate RDDs, mapping, reducing, filtering, and lambda functions
  • A procedure for importing CSV data to pySpark dataframes for SQL analysis is provided
  • A stand-alone pi estimator program will be demonstrated

Break: 10 minutes

Segment 8: Running Apache Kafka (50 mins)

  • A simple Kafka design is presented
  • Sending messages with producers
  • Reading messages with consumers
  • Connecting PySpark to Kafka

Segment 9: Example Analytics Application using Apache Zeppelin (30 mins)

  • Major features of the Zeppelin web notebook will be demonstrated
  • The Zeppelin notebook will be used to run some course examples

Segment 10: Wrap-up/Where to Go Next (10 mins)

  • A brief summary of course takeaways
  • Next steps in learning scalable analytics