Hadoop Essentials

Book description

Delve into the key concepts of Hadoop and get a thorough understanding of the Hadoop ecosystem

In Detail

This book jumps into the world of Hadoop ecosystem components and its tools in a simplified manner, and provides you with the skills to utilize them effectively for faster and effective development of Hadoop projects.

Starting with the concepts of Hadoop YARN, MapReduce, HDFS, and other Hadoop ecosystem components, you will soon learn many exciting topics such as MapReduce patterns, data management, and real-time data analysis using Hadoop. You will also get acquainted with many Hadoop ecosystem components tools such as Hive, HBase, Pig, Sqoop, Flume, Storm, and Spark.

By the end of the book, you will be confident to begin working with Hadoop straightaway and implement the knowledge gained in all your real-world scenarios.

What You Will Learn

  • Get introduced to Hadoop, big data, and the pillars of Hadoop such as HDFS, MapReduce, and YARN
  • Understand different use cases of Hadoop along with big data analytics and real-time analysis in Hadoop
  • Explore the Hadoop ecosystem tools and effectively use them for faster development and maintenance of a Hadoop project
  • Demonstrate YARN's capacity for database processing
  • Work with Hive, HBase, and Pig with Hadoop to easily figure out your big data problems
  • Gain insights into widely used tools such as Sqoop, Flume, Storm, and Spark using practical examples

Table of contents

  1. Hadoop Essentials
    1. Table of Contents
    2. Hadoop Essentials
    3. Credits
    4. About the Author
    5. Acknowledgments
    6. About the Reviewers
    7. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    9. 1. Introduction to Big Data and Hadoop
      1. V's of big data
        1. Volume
        2. Velocity
        3. Variety
      2. Understanding big data
        1. NoSQL
          1. Types of NoSQL databases
        2. Analytical database
      3. Who is creating big data?
        1. Big data use cases
      4. Big data use case patterns
        1. Big data as a storage pattern
        2. Big data as a data transformation pattern
        3. Big data for a data analysis pattern
        4. Big data for data in a real-time pattern
        5. Big data for a low latency caching pattern
      5. Hadoop
        1. Hadoop history
        2. Description
        3. Advantages of Hadoop
        4. Uses of Hadoop
        5. Hadoop ecosystem
        6. Apache Hadoop
        7. Hadoop distributions
      6. Pillars of Hadoop
      7. Data access components
      8. Data storage component
      9. Data ingestion in Hadoop
      10. Streaming and real-time analysis
      11. Summary
    10. 2. Hadoop Ecosystem
      1. Traditional systems
        1. Database trend
      2. The Hadoop use cases
      3. Hadoop's basic data flow
      4. Hadoop integration
      5. The Hadoop ecosystem
      6. Distributed filesystem
        1. HDFS
      7. Distributed programming
      8. NoSQL databases
        1. Apache HBase
      9. Data ingestion
      10. Service programming
        1. Apache YARN
        2. Apache Zookeeper
      11. Scheduling
      12. Data analytics and machine learning
      13. System management
        1. Apache Ambari
      14. Summary
    11. 3. Pillars of Hadoop – HDFS, MapReduce, and YARN
      1. HDFS
        1. Features of HDFS
        2. HDFS architecture
          1. NameNode
          2. DataNode
          3. Checkpoint NameNode or Secondary NameNode
          4. BackupNode
        3. Data storage in HDFS
          1. Read pipeline
          2. Write pipeline
        4. Rack awareness
          1. Advantages of rack awareness in HDFS
        5. HDFS federation
          1. Limitations of HDFS 1.0
          2. The benefit of HDFS federation
        6. HDFS ports
        7. HDFS commands
      2. MapReduce
        1. The MapReduce architecture
          1. JobTracker
          2. TaskTracker
        2. Serialization data types
          1. The Writable interface
          2. WritableComparable interface
        3. The MapReduce example
        4. The MapReduce process
          1. Mapper
          2. Shuffle and sorting
          3. Reducer
        5. Speculative execution
        6. FileFormats
          1. InputFormats
          2. RecordReader
          3. OutputFormats
          4. RecordWriter
        7. Writing a MapReduce program
          1. Mapper code
          2. Reducer code
          3. Driver code
        8. Auxiliary steps
          1. Combiner
          2. Partitioner
            1. Custom partitioner
      3. YARN
        1. YARN architecture
          1. ResourceManager
          2. NodeManager
          3. ApplicationMaster
        2. Applications powered by YARN
      4. Summary
    12. 4. Data Access Components – Hive and Pig
      1. Need of a data processing tool on Hadoop
      2. Pig
        1. Pig data types
        2. The Pig architecture
          1. The logical plan
          2. The physical plan
          3. The MapReduce plan
        3. Pig modes
        4. Grunt shell
          1. Input data
          2. Loading data
          3. Dump
          4. Store
            1. FOREACH generate
          5. Filter
          6. Group By
          7. Limit
          8. Aggregation
          9. Cogroup
          10. DESCRIBE
          11. EXPLAIN
          12. ILLUSTRATE
      3. Hive
        1. The Hive architecture
          1. Metastore
          2. The Query compiler
          3. The Execution engine
        2. Data types and schemas
        3. Installing Hive
        4. Starting Hive shell
        5. HiveQL
          1. DDL (Data Definition Language) operations
          2. DML (Data Manipulation Language) operations
          3. The SQL operation
            1. Joins
            2. Aggregations
          4. Built-in functions
          5. Custom UDF (User Defined Functions)
        6. Managing tables – external versus managed
        7. SerDe
        8. Partitioning
        9. Bucketing
      4. Summary
    13. 5. Storage Component – HBase
      1. An Overview of HBase
      2. Advantages of HBase
      3. The Architecture of HBase
        1. MasterServer
        2. RegionServer
          1. WAL
          2. BlockCache
            1. LRUBlockCache
            2. SlabCache
            3. BucketCache
          3. Regions
          4. MemStore
          5. Zookeeper
      4. The HBase data model
        1. Logical components of a data model
        2. ACID properties
        3. The CAP theorem
      5. The Schema design
      6. The Write pipeline
      7. The Read pipeline
      8. Compaction
        1. The Compaction policy
        2. Minor compaction
        3. Major compaction
      9. Splitting
        1. Pre-Splitting
        2. Auto Splitting
        3. Forced Splitting
      10. Commands
        1. help
        2. Create
        3. List
        4. Put
        5. Scan
        6. Get
        7. Disable
        8. Drop
      11. HBase Hive integration
      12. Performance tuning
        1. Compression
        2. Filters
        3. Counters
        4. HBase coprocessors
      13. Summary
    14. 6. Data Ingestion in Hadoop – Sqoop and Flume
      1. Data sources
      2. Challenges in data ingestion
      3. Sqoop
      4. Connectors and drivers
      5. Sqoop 1 architecture
        1. Limitation of Sqoop 1
      6. Sqoop 2 architecture
      7. Imports
      8. Exports
      9. Apache Flume
        1. Reliability
      10. Flume architecture
        1. Multitier topology
          1. Flume master
          2. Flume nodes
          3. Components in Agent
            1. Source
            2. Sink
          4. Channels
            1. Memory channel
            2. File Channel
            3. JDBC Channel
      11. Examples of configuring Flume
        1. The Single agent example
        2. Multiple flows in an agent
          1. Configuring a multiagent setup
      12. Summary
    15. 7. Streaming and Real-time Analysis – Storm and Spark
      1. An introduction to Storm
        1. Features of Storm
        2. Physical architecture of Storm
        3. Data architecture of Storm
      2. Storm topology
      3. Storm on YARN
      4. Topology configuration example
        1. Spouts
        2. Bolts
        3. Topology
      5. An introduction to Spark
        1. Features of Spark
      6. Spark framework
        1. Spark SQL
        2. GraphX
        3. MLib
        4. Spark streaming
      7. Spark architecture
        1. Directed Acyclic Graph engine
        2. Resilient Distributed Dataset
        3. Physical architecture
      8. Operations in Spark
        1. Transformations
        2. Actions
      9. Spark example
      10. Summary
    16. Index

Product information

  • Title: Hadoop Essentials
  • Author(s): Swizec Teller
  • Release date: April 2015
  • Publisher(s): Packt Publishing
  • ISBN: 9781784396688