Apache Hadoop 3 Quick Start Guide

Book description

A fast paced guide that will help you learn about Apache Hadoop 3 and its ecosystem

Key Features

  • Set up, configure and get started with Hadoop to get useful insights from large data sets
  • Work with the different components of Hadoop such as MapReduce, HDFS and YARN
  • Learn about the new features introduced in Hadoop 3

Book Description

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics, including MapReduce, YARN, and HDFS.

The book begins with an overview of big data and Apache Hadoop. Then, you will set up a pseudo Hadoop development environment and a multi-node enterprise Hadoop cluster. You will see how the parallel programming paradigm, such as MapReduce, can solve many complex data processing problems.

The book also covers the important aspects of the big data software development lifecycle, including quality assurance and control, performance, administration, and monitoring.

You will then learn about the Hadoop ecosystem, and tools such as Kafka, Sqoop, Flume, Pig, Hive, and HBase. Finally, you will look at advanced topics, including real time streaming using Apache Storm, and data analytics using Apache Spark.

By the end of the book, you will be well versed with different configurations of the Hadoop 3 cluster.

What you will learn

  • Store and analyze data at scale using HDFS, MapReduce and YARN
  • Install and configure Hadoop 3 in different modes
  • Use Yarn effectively to run different applications on Hadoop based platform
  • Understand and monitor how Hadoop cluster is managed
  • Consume streaming data using Storm, and then analyze it using Spark
  • Explore Apache Hadoop ecosystem components, such as Flume, Sqoop, HBase, Hive, and Kafka

Who this book is for

Aspiring Big Data professionals who want to learn the essentials of Hadoop 3 will find this book to be useful. Existing Hadoop users who want to get up to speed with the new features introduced in Hadoop 3 will also benefit from this book. Having knowledge of Java programming will be an added advantage.

Publisher resources

Download Example Code

Table of contents

  1. Title Page
  2. Copyright and Credits
    1. Apache Hadoop 3 Quick Start Guide
  3. Dedication
  4. Packt Upsell
    1. Why subscribe?
    2. Packt.com
  5. Contributors
    1. About the author
    2. About the reviewer
    3. Packt is searching for authors like you
  6. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Code in action
      3. Conventions used
    4. Get in touch
      1. Reviews
  7. Hadoop 3.0 - Background and Introduction
    1. How it all started 
    2. What Hadoop is and why it is important
    3. How Apache Hadoop works 
      1. Resource Manager
      2. Node Manager
      3. YARN Timeline Service version 2
      4. NameNode
      5. DataNode
    4. Hadoop 3.0 releases and new features
    5. Choosing the right Hadoop distribution
      1. Cloudera Hadoop distribution
      2. Hortonworks Hadoop distribution
      3. MapR Hadoop distribution
    6. Summary
  8. Planning and Setting Up Hadoop Clusters
    1. Technical requirements
    2. Prerequisites for Hadoop setup
      1. Preparing hardware for Hadoop
      2. Readying your system
      3. Installing the prerequisites
      4. Working across nodes without passwords (SSH in keyless)
      5. Downloading Hadoop
    3. Running Hadoop in standalone mode
    4. Setting up a pseudo Hadoop cluster
    5. Planning and sizing clusters
      1. Initial load of data
      2. Organizational data growth
      3. Workload and computational requirements
      4. High availability and fault tolerance
      5. Velocity of data and other factors
    6. Setting up Hadoop in cluster mode
      1. Installing and configuring HDFS in cluster mode
      2. Setting up YARN in cluster mode
    7. Diagnosing the Hadoop cluster
      1. Working with log files
      2. Cluster debugging and tuning tools
        1. JPS (Java Virtual Machine Process Status)
        2. JStack
    8. Summary
  9. Deep Dive into the Hadoop Distributed File System
    1. Technical requirements
    2. How HDFS works
    3. Key features of HDFS
      1. Achieving multi tenancy in HDFS
      2. Snapshots of HDFS
      3. Safe mode
      4. Hot swapping
      5. Federation
      6. Intra-DataNode balancer
    4. Data flow patterns of HDFS
      1. HDFS as primary storage with cache
      2. HDFS as archival storage
      3. HDFS as historical storage
      4. HDFS as a backbone
    5. HDFS configuration files
    6. Hadoop filesystem CLIs
      1. Working with HDFS user commands
      2. Working with Hadoop shell commands
    7. Working with data structures in HDFS
      1. Understanding SequenceFile
      2. MapFile and its variants
    8. Summary
  10. Developing MapReduce Applications
    1. Technical requirements
    2. How MapReduce works
      1. What is MapReduce?
      2. An example of MapReduce
    3. Configuring a MapReduce environment
      1. Working with mapred-site.xml
      2. Working with Job history server
        1. RESTful APIs for Job history server
    4. Understanding Hadoop APIs and packages
    5. Setting up a MapReduce project
      1. Setting up an Eclipse project
    6. Deep diving into MapReduce APIs
      1. Configuring MapReduce jobs
      2. Understanding input formats
      3. Understanding output formats
      4. Working with Mapper APIs
      5. Working with the Reducer API
    7. Compiling and running MapReduce jobs
      1. Triggering the job remotely
      2. Using Tool and ToolRunner
      3. Unit testing of MapReduce jobs
      4. Failure handling in MapReduce
    8. Streaming in MapReduce programming
    9. Summary
  11. Building Rich YARN Applications
    1. Technical requirements
    2. Understanding YARN architecture
    3. Key features of YARN
      1. Resource models in YARN
      2. YARN federation
      3. RESTful APIs
    4. Configuring the YARN environment in a cluster
    5. Working with YARN distributed CLI
    6. Deep dive with YARN application framework
      1. Setting up YARN projects
      2. Writing your YARN application with YarnClient
      3. Writing a custom application master
    7. Building and monitoring a YARN application on a cluster
      1. Building a YARN application
      2. Monitoring your application
    8. Summary
  12. Monitoring and Administration of a Hadoop Cluster
    1. Roles and responsibilities of Hadoop administrators
    2. Planning your distributed cluster
      1. Hadoop applications, ports, and URLs
    3. Resource management in Hadoop
      1. Fair Scheduler
      2. Capacity Scheduler
    4. High availability of Hadoop
      1. High availability for NameNode
      2. High availability for Resource Manager
    5. Securing Hadoop clusters
      1. Securing your Hadoop application
      2. Securing your data in HDFS
    6. Performing routine tasks
      1. Working with safe mode
      2. Archiving in Hadoop
      3. Commissioning and decommissioning of nodes
      4. Working with Hadoop Metric
    7. Summary
  13. Demystifying Hadoop Ecosystem Components
    1. Technical requirements
    2. Understanding Hadoop's Ecosystem
    3. Working with Apache Kafka
    4. Writing Apache Pig scripts
      1. Pig Latin
      2. User-defined functions (UDFs)
    5. Transferring data with Sqoop
    6. Writing Flume jobs
    7. Understanding Hive
      1. Interacting with Hive – CLI, beeline, and web interface
      2. Hive as a transactional system
    8. Using HBase for NoSQL storage
    9. Summary
  14. Advanced Topics in Apache Hadoop
    1. Technical requirements
    2. Hadoop use cases in industries
      1. Healthcare
      2. Oil and Gas
      3. Finance 
      4. Government Institutions
      5. Telecommunications
      6. Retail
      7. Insurance
    3. Advanced Hadoop data storage file formats
      1. Parquet
      2. Apache ORC
      3. Avro 
    4. Real-time streaming with Apache Storm
    5. Data analytics with Apache Spark
    6. Summary
  15. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product information

  • Title: Apache Hadoop 3 Quick Start Guide
  • Author(s): Hrishikesh Vijay Karambelkar
  • Release date: October 2018
  • Publisher(s): Packt Publishing
  • ISBN: 9781788999830