Master Big Data Ingestion and Analytics with Flume, Sqoop, Hive and Spark

Video description

In this course, you will start by learning about the Hadoop Distributed File System (HDFS) and the most common Hadoop commands required to work with HDFS. Next, you’ll be introduced to Sqoop Import, which will help you gain insights into the lifecycle of the Sqoop command and how to use the import command to migrate data from MySQL to HDFS, and from MySQL to Hive.

In addition to this, you will get up to speed with Sqoop Export for migrating data effectively, along with using Apache Flume to ingest data. As you progress, you will delve into Apache Hive, external and managed tables, working with different files, and Parquet and Avro. Toward the concluding section, you will focus on Spark DataFrames and Spark SQL.

By the end of this course, you will have gained comprehensive insights into big data ingestion and analytics with Flume, Sqoop, Hive, and Spark.

What You Will Learn

  • Explore the Hadoop Distributed File System (HDFS) and commands
  • Get to grips with the lifecycle of the Sqoop command
  • Use the Sqoop Import command to migrate data from MySQL to HDFS and Hive
  • Understand split-by and boundary queries
  • Use the incremental mode to migrate data from MySQL to HDFS
  • Employ Sqoop Export to migrate data from HDFS to MySQL
  • Discover Spark DataFrames and gain insights into working with different file formats and compression

Audience

This course is for anyone who wants to learn Sqoop and Flume or those looking to achieve CCA and HDP certification.

About The Author

Navdeep Kaur: Navdeep Kaur - Technical Trainer

Navdeep Kaur is a big data professionals with 11 years of industry experience in different technologies and domains. She has a keen interest in providing training in new technologies. She has received CCA175 Hadoop and Spark developer certification and AWS solution architect certification. She loves guiding people and helping them achieves new goals.

Publisher resources

Download Example Code

Table of contents

  1. Chapter 1 : Hadoop Introduction
    1. HDFS and Hadoop Commands
  2. Chapter 2 : Sqoop Import
    1. Sqoop Introduction
    2. Managing Target Directories
    3. Working with Different File Formats
    4. Working with Different Compressions
    5. Conditional Imports
    6. Split-by and Boundary Queries
    7. Field delimeters
    8. Incremental Appends
    9. Sqoop Hive Import
    10. Sqoop List Tables/Database
    11. Sqoop Import Practice1
    12. Sqoop Import Practice2
    13. Sqoop Import Practice3
  3. Chapter 3 : Sqoop Export
    1. Export from Hdfs to Mysql
    2. Export from Hive to Mysql
  4. Chapter 4 : Apache Flume
    1. Flume Introduction Architecture
    2. Exec Source and Logger Sink
    3. Moving data from Twitter to HDFS
    4. Moving data from NetCat to HDFS
    5. Flume Interceptors
    6. Flume Interceptor Example
    7. Flume Multi-Agent Flow
    8. Flume Consolidation
  5. Chapter 5 : Apache Hive
    1. Hive Introduction
    2. Hive Database
    3. Hive Managed Tables
    4. Hive External Tables
    5. Hive Inserts
    6. Hive Analytics
    7. Working with Parquet
    8. Compressing Parquet
    9. Working with Fixed File Format
    10. Alter Command
    11. Hive String Functions
    12. Hive Date Functions
    13. Hive Partitioning
    14. Hive Bucketing
  6. Chapter 6 : Spark Introduction
    1. Spark Introduction
    2. Resilient Distributed Datasets
    3. Cluster Overview
    4. Directed Acyclic Graph (DAG) Stages
  7. Chapter 7 : Spark Transformations Actions
    1. Map/FlatMap Transformation
    2. Filter/Intersection
    3. Union/Distinct Transformation
    4. GroupByKey/ Group people based on Birthday months
    5. ReduceByKey / Total Number of students in each Subject
    6. SortByKey / Sort students based on their rollno
    7. MapPartition / MapPartitionWithIndex
    8. Change number of Partitions
    9. Join / Join email address based on customer name
    10. Spark Actions
  8. Chapter 8 : Spark RDD Practice
    1. Scala Tuples
    2. Extract Error Logs from log files
    3. Frequency of word in Text File
    4. Population of each City
    5. Orders placed by Customers
    6. Movie Average Rating greater than 3
  9. Chapter 9 : Spark Dataframes Spark SQL
    1. Dataframe Intro
    2. Dafaframe from Json Files
    3. Dataframe from Parquet Files
    4. Dataframe from CSV Files
    5. Dataframe from Avro/XML Files
    6. Working with Different Compressions
    7. DataFrame API Part1
    8. DataFrame API Part2
    9. Spark SQL
    10. Working with Hive Tables in Spark

Product information

  • Title: Master Big Data Ingestion and Analytics with Flume, Sqoop, Hive and Spark
  • Author(s): Navdeep Kaur
  • Release date: July 2019
  • Publisher(s): Packt Publishing
  • ISBN: 9781839212734