Mastering Hadoop 3

Book description

A comprehensive guide to mastering the most advanced Hadoop 3 concepts

Key Features

  • Get to grips with the newly introduced features and capabilities of Hadoop 3
  • Crunch and process data using MapReduce, YARN, and a host of tools within the Hadoop ecosystem
  • Sharpen your Hadoop skills with real-world case studies and code

Book Description

Apache Hadoop is one of the most popular big data solutions for distributed storage and for processing large chunks of data. With Hadoop 3, Apache promises to provide a high-performance, more fault-tolerant, and highly efficient big data processing platform, with a focus on improved scalability and increased efficiency.

With this guide, you'll understand advanced concepts of the Hadoop ecosystem tool. You'll learn how Hadoop works internally, study advanced concepts of different ecosystem tools, discover solutions to real-world use cases, and understand how to secure your cluster. It will then walk you through HDFS, YARN, MapReduce, and Hadoop 3 concepts. You'll be able to address common challenges like using Kafka efficiently, designing low latency, reliable message delivery Kafka systems, and handling high data volumes. As you advance, you'll discover how to address major challenges when building an enterprise-grade messaging system, and how to use different stream processing systems along with Kafka to fulfil your enterprise goals.

By the end of this book, you'll have a complete understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable data pipeline, and you'll be equipped to tackle a range of real-world problems in data pipelines.

What you will learn

  • Gain an in-depth understanding of distributed computing using Hadoop 3
  • Develop enterprise-grade applications using Apache Spark, Flink, and more
  • Build scalable and high-performance Hadoop data pipelines with security, monitoring, and data governance
  • Explore batch data processing patterns and how to model data in Hadoop
  • Master best practices for enterprises using, or planning to use, Hadoop 3 as a data platform
  • Understand security aspects of Hadoop, including authorization and authentication

Who this book is for

If you want to become a big data professional by mastering the advanced concepts of Hadoop, this book is for you. You'll also find this book useful if you're a Hadoop professional looking to strengthen your knowledge of the Hadoop ecosystem. Fundamental knowledge of the Java programming language and basics of Hadoop is necessary to get started with this book.

Table of contents

  1. Title Page
  2. Copyright and Credits
    1. Mastering Hadoop 3
  3. Dedication
  4. About Packt
    1. Why subscribe?
    2. Packt.com
  5. Foreword
  6. Contributors
    1. About the authors
    2. About the reviewer
    3. Packt is searching for authors like you
  7. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Code in action
      4. Conventions used
    4. Get in touch
      1. Reviews
  8. Section 1: Introduction to Hadoop 3
  9. Journey to Hadoop 3
    1. Hadoop origins and Timelines
      1. Origins
        1. MapReduce origin
      2. Timelines
    2. Overview of Hadoop 3 and its features
    3. Hadoop logical view
    4. Hadoop distributions
      1. On-premise distribution
      2. Cloud distributions
    5. Points to remember
    6. Summary
  10. Deep Dive into the Hadoop Distributed File System
    1. Technical requirements
    2. Defining HDFS
    3. Deep dive into the HDFS architecture
      1. HDFS logical architecture
        1. Concepts of the data group
          1. Blocks
          2. Replication
      2. HDFS communication architecture
    4. NameNode internals
      1. Data locality and rack awareness
    5. DataNode internals
    6. Quorum Journal Manager (QJM)
    7. HDFS high availability in Hadoop 3.x
    8. Data management
      1. Metadata management
        1. Checkpoint using a secondary NameNode
      2. Data integrity
      3. HDFS Snapshots
      4. Data rebalancing
      5. Best practices for using balancer 
    9. HDFS reads and writes
      1. Write workflows
      2. Read workflows
      3. Short circuit reads
    10. Managing disk-skewed data in Hadoop 3.x
    11. Lazy persist writes in HDFS
    12. Erasure encoding in Hadoop 3.x
      1. Advantages of erasure coding
      2. Disadvantages of erasure coding
    13. HDFS common interfaces
      1. HDFS read 
      2. HDFS write 
        1. HDFSFileSystemWrite.java
      3. HDFS delete 
    14. HDFS command reference
      1. File System commands
      2. Distributed copy
      3. Admin commands
    15. Points to remember
    16. Summary
  11. YARN Resource Management in Hadoop
    1. Architecture
      1. Resource Manager component
      2. Node manager core
    2. Introduction to YARN job scheduling
    3. FIFO scheduler
    4. Capacity scheduler
      1. Configuring capacity scheduler 
    5. Fair scheduler
      1. Scheduling queues
      2. Configuring fair scheduler 
    6. Resource Manager high availability
      1. Architecture of RM high availability
      2. Configuring Resource Manager high availability
    7. Node labels
      1. ​Configuring node labels
    8. YARN Timeline server in Hadoop 3.x
      1. Configuring YARN Timeline server
    9. Opportunistic containers in Hadoop 3.x
      1. Configuring opportunist container
    10. Docker containers in YARN
      1. Configuring Docker containers
        1. Running the Docker image 
        2. Running the container 
    11. YARN REST APIs
      1. Resource Manager API
      2. Node Manager REST API 
    12. YARN command reference
      1. User command
        1.  Application commands
        2. Logs command
      2. Administration commands
    13. Summary
  12. Internals of MapReduce
    1. Technical requirements
    2. Deep dive into the Hadoop MapReduce framework
    3. YARN and MapReduce
    4. MapReduce workflow in the Hadoop framework
    5. Common MapReduce patterns
      1. Summarization patterns
        1. Word count example
          1. Mapper
          2. Reducer
          3. Combiner
        2. Minimum and maximum
      2. Filtering patterns
        1. Top-k MapReduce implementation  
      3. Join pattern 
        1. Reduce side join
        2. Map side join (replicated join)
      4. Composite join
        1. Sorting and partitioning
    6. MapReduce use case
      1. MovieRatingMapper
      2. MovieRatingReducer 
      3. MovieRatingDriver
    7. Optimizing MapReduce
      1. Hardware configuration
      2. Operating system tuning
      3. Optimization techniques  
      4. Runtime configuration
      5. File System optimization
    8. Summary
  13. Section 2: Hadoop Ecosystem
  14. SQL on Hadoop
    1. Technical requirements
    2. Presto – introduction
      1. Presto architecture
      2. Presto installation and basic query execution
      3. Functions
        1. Conversion functions 
        2. Mathematical functions
        3. String functions
      4. Presto connectors
        1. Hive connector
        2. Kafka connector
          1. Configuration properties
        3. MySQL connector
        4. Redshift connector
        5. MongoDB connector 
    3. Hive
      1. Apache Hive architecture
      2. Installing and running Hive
      3. Hive queries 
        1. Hive table creation 
        2. Loading data to a table 
        3. The select query
      4. Choosing file format 
        1. Splitable and non-splitable file formats
          1. Query performance 
          2. Disk usage and compression
          3. Schema change 
      5. Introduction to HCatalog
      6. Introduction to HiveServer2
      7. Hive UDF
      8. Understanding ACID in HIVE
        1.  Example
      9. Partitioning and bucketing
        1. Prerequisite 
        2. Partitioning 
        3. Bucketing
      10. Best practices
    4. Impala
      1. Impala architecture
      2. Understanding the Impala interface and queries
      3. Practicing Impala
        1. Loading Data from CSV files
      4. Best practices
    5. Summary 
  15. Real-Time Processing Engines
    1. Technical requirements
    2. Spark
      1. Apache Spark internals
        1. Spark driver
        2. Spark workers
        3. Cluster manager
        4. Spark application job flow
      2. Deep dive into resilient distributed datasets
        1. RDD features
        2. RDD operations
      3. Installing and running our first Spark job
        1. Spark-shell 
        2. Spark submit command 
        3. Maven dependencies
      4. Accumulators and broadcast variables
      5. Understanding dataframe and dataset
        1. Dataframes 
        2. Dataset
      6. Spark cluster managers
      7. Best practices
    3. Apache Flink
      1. Flink architecture
      2. Apache Flink ecosystem component
      3. Dataset and data stream API
        1. Dataset API
          1. Transformation
          2. Data sinks 
        2. Data streams 
      4. Exploring the table API
      5. Best practices
    4. Storm/Heron
      1. Deep dive into the Storm/Heron architecture
        1. Concept of a Storm application
        2. Introduction to Apache Heron
        3. Heron architecture
      2. Understanding Storm Trident
      3. Storm integrations
      4. Best practices
    5. Summary
  16. Widely Used Hadoop Ecosystem Components
    1. Technical requirements
    2. Pig
      1. Apache Pig architecture
      2. Installing and running Pig
      3. Introducing Pig Latin and Grunt
      4. Writing UDF in Pig
        1. Eval function
        2. Filter function 
        3. How to use custom UDF in Pig
      5. Pig with Hive
      6. Best practices
    3. HBase
      1. HBase architecture and its concept
      2. CAP theorem
      3. HBase operations and its examples
        1. Put operation
        2. Get operation
        3. Delete operation 
        4. Batch operation 
      4. Installation
        1. Local mode Installation 
        2. Distributed mode installation
          1. Master node configuration 
          2. Slave node configuration 
      5. Best practices
    4. Kafka
      1. Apache Kafka architecture
      2. Installing and running Apache Kafka
        1. Local mode installation 
        2. Distributed mode 
      3. Internals of producer and consumer
        1. Producer
        2. Consumer
      4. Writing producer and consumer application
      5. Kafka Connect for ETL
      6. Best practices
    5. Flume
      1. Apache Flume architecture
      2. Deep dive into source, channel, and sink
        1. Sources
          1. Pollable source
          2. Event-driven source
        2. Channels
          1. Memory channel
          2. File channel
          3. Kafka channel
        3. Sinks
      3. Flume interceptor
        1. Timestamp interceptor
        2. Universally Unique Identifier (UUID) interceptor 
        3. Regex filter interceptor 
        4. Writing a custom interceptor 
      4. Use case – Twitter data
      5. Best practices
    6. Summary
  17. Section 3: Hadoop in the Real World
  18. Designing Applications in Hadoop
    1. Technical requirements
    2. File formats
      1. Understanding file formats
        1. Row format and column format
        2. Schema evolution
        3. Splittable versus non-splittable
        4. Compression
      2. Text
      3. Sequence file
      4. Avro
      5. Optimized Row Columnar (ORC)
      6. Parquet
    3. Data compression
      1. Types of data compression in Hadoop
        1. Gzip
        2. BZip2
        3. Lempel-Ziv-Oberhumer
        4. Snappy
      2. Compression format consideration
    4. Serialization
    5. Data ingestion
      1. Batch ingestion 
      2. Macro batch ingestion
      3. Real-time ingestion
    6. Data processing
      1. Batch processing
      2. Micro batch processing
      3. Real-time processing
    7. Common batch processing pattern
      1. Slowly changing dimension
        1. Slowly changing dimensions – type 1
        2. Slowly changing dimensions - type 2
      2. Duplicate record and small files
      3. Real-time lookup
    8. Airflow for orchestration
    9. Data governance
      1. Data governance pillars
        1. Metadata management
        2. Data life cycle management
        3. Data classification
    10. Summary
  19. Real-Time Stream Processing in Hadoop
    1. Technical requirements
    2. What are streaming datasets?
    3. Stream data ingestion
      1. Flume event-based data ingestion
      2. Kafka
    4. Common stream data processing patterns
      1. Unbounded data batch processing
    5. Streaming design considerations
      1. Latency
      2. Data availability, integrity, and security
      3. Unbounded data sources
      4. Data lookups
      5. Data formats
      6. Serializing your data
      7. Parallel processing
      8. Out-of-order events
      9. Message delivery semantics
    6. Micro-batch processing case study
    7. Real-time processing case study
      1. Main code
      2. Executing the code
    8. Summary
  20. Machine Learning in Hadoop
    1. Technical requirements
    2. Machine learning steps
    3. Common machine learning challenges
    4. Spark machine learning
      1. Transformer function
      2. Estimator
      3. Spark ML pipeline
    5. Hadoop and R
    6. Mahout
    7. Machine learning case study in Spark
      1. Sentiment analysis using Spark ML
    8. Summary
  21. Hadoop in the Cloud
    1. Technical requirements
    2. Logical view of Hadoop in the cloud
    3. Network
      1. Regions and availability zone
      2. VPC and subnet
      3. Security groups/firewall rules
      4. Practical example using AWS
    4. Managing resources
      1. Cloud-watch
    5. Data pipelines
      1. Amazon Data Pipeline
      2. Airflow
        1. Airflow components
      3. Sample data pipeline DAG example
    6. High availability (HA)
      1.  Server failure 
        1. Server instance high availability
        2. Region and zone failure 
      2. Cloud storage high availability
        1. Amazon S3 outage case history
    7. Summary
  22. Hadoop Cluster Profiling
    1. Introduction to benchmarking and profiling
    2. HDFS
      1. DFSIO
    3. NameNode
      1. NNBench
      2. NNThroughputBenchmark
      3. Synthetic load generator (SLG)
    4. YARN
      1. Scheduler Load Simulator (SLS)
    5. Hive
      1. TPC-DS
      2. TPC-H
    6. Mix-workloads
      1. Rumen
      2. Gridmix
    7. Summary
  23. Section 4: Securing Hadoop
  24. Who Can Do What in Hadoop
    1. Hadoop security pillars
    2. System security
    3. Kerberos authentication
      1. Kerberos advantages
      2. Kerberos authentication flows
        1. Service authentication
        2. User authentication
        3. Communication between the authenticated client and the authenticated Hadoop service
        4. Symmetric key-based communication in Hadoop
    4. User authorization
      1. Ranger
      2. Sentry
    5. List of security features that have been worked upon in Hadoop 3.0
    6. Summary
  25. Network and Data Security
    1. Securing Hadoop networks
      1. Segregating different types of networks
      2. Network firewalls
      3. Tools for securing Hadoop services' network perimeter
    2. Encryption
      1. Data in transit encryption
      2. Data at rest encryption
    3. Masking
    4. Filtering
      1. Row-level filtering
      2. Column-level filtering
    5. Summary
  26. Monitoring Hadoop
    1. General monitoring
      1. HDFS metrics
        1. NameNode metrics
        2. DataNode metrics
      2. YARN metrics
      3. ZooKeeper metrics
      4. Apache Ambari 
    2. Security monitoring
      1. Security information and event management
      2. How does SIEM work?
      3. Intrusion detection system 
      4. Intrusion prevention system
    3. Summary
  27. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product information

  • Title: Mastering Hadoop 3
  • Author(s): Chanchal Singh, Manish Kumar
  • Release date: February 2019
  • Publisher(s): Packt Publishing
  • ISBN: 9781788620444