The Ultimate Hands-On Hadoop

Video description

Understanding Hadoop is a highly valuable skill for anyone working at companies that work with large amounts of data. Companies such as Amazon, eBay, Facebook, Google, LinkedIn, IBM, Spotify, Twitter, and Yahoo, use Hadoop in some way to process huge chunks of data. This video course will make you familiar with Hadoop's ecosystem and help you to understand how to apply Hadoop skills in the real world.

The course starts by taking you through the installation process of Hadoop on your desktop. Next, you will manage big data on a cluster with Hadoop Distributed File System (HDFS) and MapReduce, and use Pig and Spark to analyze data on Hadoop. Moving along, you will learn how to store and query your data using applications, such as Sqoop, Hive, MySQL, Phoenix, and MongoDB. Next, you will design real-world systems using the Hadoop ecosystem and learn how to manage clusters with Yet Another Resource Negotiator (YARN), Mesos, Zookeeper, Oozie, Zeppelin, and Hue. Towards the end, you will uncover the techniques to handle and stream data in real-time using Kafka, Flume, Spark Streaming, Flink, and Storm.

By the end of this course, you will become well-versed with the Hadoop ecosystem and will develop the skills required to store, analyze, and scale big data using Hadoop.

What You Will Learn

  • Become familiar with Hortonworks and the Ambari User Interface (UI)
  • Use Pig and Spark to create scripts to process data on a Hadoop cluster
  • Analyze non-relational data using HBase, Cassandra, and MongoDB
  • Query data interactively with Drill, Phoenix, and Presto
  • Publish data to your Hadoop cluster using Kafka, Sqoop, and Flume
  • Consume streaming data using Spark Streaming, Flink, and Storm


This video course is designed for people at every level; whether you are a software engineer or a programmer who wants to understand the Hadoop ecosystem, or a project manager who wants to become familiar with the Hadoop's lingo, or a system architect who wants to understand the components available in the Hadoop system. To get started with this course, a basic understanding of Python or Scala and ground-level knowledge of the Linux command line are recommended.

About The Author

Frank Kane: Frank Kane has spent nine years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers all the time. He holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology and teaches others about big data analysis.

Publisher resources

Download Example Code

Table of contents

  1. Chapter 1 : Learning All the Buzzwords and Installing the Hortonworks Data Platform Sandbox
    1. Introduction and Installation of Hadoop
    2. The Hortonworks and Cloudera Merger and its Effects on the Course
    3. Hadoop Overview and History
    4. Overview of the Hadoop Ecosystem
  2. Chapter 2 : Using the Hadoop's Core: Hadoop Distributed File System (HDFS) and MapReduce
    1. Hadoop Distributed File System (HDFS): What it is and How it Works
    2. Installing the MovieLens Dataset
    3. Activity - Installing the MovieLens Dataset into Hadoop's Distributed File System (HDFS) using the Command Line
    4. MapReduce: What it is and How it Works
    5. How MapReduce Distributes Processing
    6. MapReduce Example: Breaking Down the Movie Ratings by Rating Score
    7. Activity - Installing Python, MRJob, and Nano
    8. Activity - Coding Up and Running the Ratings Histogram MapReduce Job
    9. Exercise – Ranking Movies by Their Popularity
    10. Activity - Checking Results
  3. Chapter 3 : Programming Hadoop with Pig
    1. Introducing Ambari
    2. Introducing the Pig
    3. Example - Finding the Oldest Movie with Five-Star Rating Using the Pig
    4. Activity – Finding the Old Five-Star Movies with Pig
    5. More Pig Latin
    6. Exercise - Finding the Most-Rated One-Star Movie
    7. Pig Challenge - Comparing Results
  4. Chapter 4 : Programming Hadoop with Spark
    1. Why Spark?
    2. The Resilient Distributed Datasets (RDD)
    3. Activity – Finding the Movie with the Lowest Average Rating with the Resilient Distributed Datasets (RDD)
    4. Datasets and Spark 2.0
    5. Activity – Finding the movie with the Lowest Average Rating with DataFrames
    6. Activity – Recommending a Movie with Spark's Machine Learning Library (MLLib)
    7. Exercise – Filtering the Lowest-Rated Movies by Number of Ratings
    8. Activity - Checking Results
  5. Chapter 5 : Using Relational Datastores with Hadoop
    1. What is Hive?
    2. Activity – Using Hive to Find the Most Popular Movie
    3. How Hive Works?
    4. Exercise – Using Hive to Find the Movie with the Highest Average Rating
    5. Comparing Solutions
    6. Integrating MySQL with Hadoop
    7. Activity – Installing MySQL and Importing Movie Data
    8. Activity - Using Sqoop to Import Data from MySQL to HFDS/Hive
    9. Activity – Using Sqoop to Export Data from Hadoop to MySQL
  6. Chapter 6 : Using Non-Relational Data Stores with Hadoop
    1. Why NoSQL?
    2. What is HBase?
    3. Activity – Importing Movie Ratings into HBase
    4. Activity – Using HBase with Pig to Import Data at Scale
    5. Cassandra – Overview
    6. Activity - Installing Cassandra
    7. Activity - Writing Spark Output into Cassandra
    8. MongoDB - Overview
    9. Activity - Installing MongoDB and Integrating Spark with MongoDB
    10. Activity - Using the MongoDB Shell
    11. Choosing Database Technology
    12. Exercise - Choosing a Database for a Given Problem
  7. Chapter 7 : Querying Data Interactively
    1. Overview of Drill
    2. Activity - Setting Up Drill
    3. Activity - Querying Across Multiple Databases with Drill
    4. Overview of Phoenix
    5. Activity - Installing Phoenix and Querying HBase
    6. Activity - Integrating Phoenix with the Pig
    7. Overview of Presto
    8. Activity - Installing Presto and Querying Hive
    9. Activity - Querying Both Cassandra and Hive Using Presto
  8. Chapter 8 : Managing Your Cluster
    1. Yet Another Resource Negotiator (YARN)
    2. Tez
    3. Activity - Using Hive on Tez and Measuring the Performance Benefit
    4. Mesos
    5. ZooKeeper
    6. Activity - Simulating a Failing Master with ZooKeeper
    7. Oozie
    8. Activity – Setting Up a Simple Oozie Workflow
    9. Zeppelin - Overview
    10. Activity - Using Zeppelin to Analyze Movie Ratings - Part 1
    11. Activity - Using Zeppelin to Analyze Movie Ratings - Part 2
    12. Hue - Overview
    13. Other Technologies Worth Mentioning
  9. Chapter 9 : Feeding Data to Your Cluster
    1. Kafka
    2. Activity - Setting Up Kafka and Publishing Data
    3. Activity - Publishing Web Logs with Kafka
    4. Flume
    5. Activity - Setting up Flume and Publishing Logs
    6. Activity – Setting Up Flume to Monitor a Directory and Store its Data in Hadoop Distributed File System (HDFS)
  10. Chapter 10 : Analyzing Streams of Data
    1. Spark Streaming: Introduction
    2. Activity - Analyzing Web Logs Published with Flume using Spark Streaming
    3. Exercise - Monitor Flume-Published Logs for Errors in Real Time
    4. Exercise Solution: Aggregating the Hypertext Transfer Protocol (HTTP) Access Codes with Spark Streaming
    5. Apache Storm: Introduction
    6. Activity - Counting Words with Storm
    7. Flink: Overview
    8. Activity - Counting Words with Flink
  11. Chapter 11 : Designing Real-World Systems
    1. The Best of the Rest
    2. Review: How the Pieces Fit Together?
    3. Understanding Your Requirements
    4. Sample Application: Consuming Web Server Logs and Keeping Track of Top-Sellers
    5. Sample Application: Serving Movie Recommendations to a Website
    6. Exercise - Designing a System to Report Web Sessions Per Day
    7. Exercise Solution: Designing a System to Count Daily Sessions
  12. Chapter 12 : Learning More
    1. Books and Online Resources

Product information

  • Title: The Ultimate Hands-On Hadoop
  • Author(s): Frank Kane
  • Release date: December 2020
  • Publisher(s): Packt Publishing
  • ISBN: 9781788478489