Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Book description

Many corporations are finding that the size of their data sets are outgrowing the capability of their systems to store and process them. The data is becoming too big to manage and use with traditional tools. The solution: implementing a big data system.

As Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset shows, Apache Hadoop offers a scalable, fault-tolerant system for storing and processing data in parallel. It has a very rich toolset that allows for storage (Hadoop), configuration (YARN and ZooKeeper), collection (Nutch and Solr), processing (Storm, Pig, and Map Reduce), scheduling (Oozie), moving (Sqoop and Avro), monitoring (Chukwa, Ambari, and Hue), testing (Big Top), and analysis (Hive).

The problem is that the Internet offers IT pros wading into big data many versions of the truth and some outright falsehoods born of ignorance. What is needed is a book just like this one: a wide-ranging but easily understood set of instructions to explain where to get Hadoop tools, what they can do, how to install them, how to configure them, how to integrate them, and how to use them successfully. And you need an expert who has worked in this area for a decade—someone just like author and big data expert Mike Frampton.

Big Data Made Easy approaches the problem of managing massive data sets from a systems perspective, and it explains the roles for each project (like architect and tester, for example) and shows how the Hadoop toolset can be used at each system stage. It explains, in an easily understood manner and through numerous examples, how to use each tool. The book also explains the sliding scale of tools available depending upon data size and when and how to use them. Big Data Made Easy shows developers and architects, as well as testers and project managers, how to:

  • Store big data
  • Configure big data
  • Process big data
  • Schedule processes
  • Move data among SQL and NoSQL systems
  • Monitor data
  • Perform big data analytics
  • Report on big data processes and projects
  • Test big data systems
  • Big Data Made Easy also explains the best part, which is that this toolset is free. Anyone can download it and—with the help of this book—start to use it within a day. With the skills this book will teach you under your belt, you will add value to your company or client immediately, not to mention your career.

    Table of contents

    1. Cover
    2. Title
    3. Copyright
    4. Dedication
    5. Contents at a Glance
    6. Contents
    7. About the Author
    8. About the Technical Reviewer
    9. Acknowledgments
    10. Introduction
    11. Chapter 1: The Problem with Data
      1. A Definition of “Big Data”
      2. The Potentials and Difficulties of Big Data
        1. Requirements for a Big Data System
        2. How Hadoop Tools Can Help
        3. My Approach
      3. Overview of the Big Data System
        1. Big Data Flow and Storage
        2. Benefits of Big Data Systems
      4. What’s in This Book
        1. Storage: Chapter 2
        2. Data Collection: Chapter 3
        3. Processing: Chapter 4
        4. Scheduling: Chapter 5
        5. Data Movement: Chapter 6
        6. Monitoring: Chapter 7
        7. Cluster Management: Chapter 8
        8. Analysis: Chapter 9
        9. ETL: Chapter 10
        10. Reports: Chapter 11
      5. Summary
    12. Chapter 2: Storing and Configuring Data with Hadoop, YARN, and ZooKeeper
      1. An Overview of Hadoop
        1. The Hadoop V1 Architecture
        2. The Differences in Hadoop V2
        3. The Hadoop Stack
        4. Environment Management
      2. Hadoop V1 Installation
        1. Hadoop 1.2.1 Single-Node Installation
        2. Setting up the Cluster
        3. Running a Map Reduce Job Check
        4. Hadoop User Interfaces
      3. Hadoop V2 Installation
        1. ZooKeeper Installation
        2. Hadoop MRv2 and YARN
      4. Hadoop Commands
        1. Hadoop Shell Commands
        2. Hadoop User Commands
        3. Hadoop Administration Commands
      5. Summary
    13. Chapter 3: Collecting Data with Nutch and Solr
      1. The Environment
        1. Stopping the Servers
        2. Changing the Environment Scripts
        3. Starting the Servers
      2. Architecture 1: Nutch 1.x
        1. Nutch Installation
        2. Solr Installation
        3. Running Nutch with Hadoop 1.8
      3. Architecture 2: Nutch 2.x
        1. Nutch and Solr Configuration
        2. HBase Installation
        3. Gora Configuration
        4. Running the Nutch Crawl
        5. Potential Errors
      4. A Brief Comparison
      5. Summary
    14. Chapter 4: Processing Data with Map Reduce
      1. An Overview of the Word-Count Algorithm
      2. Map Reduce Native
        1. Java Word-Count Example 1
        2. Java Word-Count Example 2
        3. Comparing the Examples
      3. Map Reduce with Pig
        1. Installing Pig
        2. Running Pig
        3. Pig User-Defined Functions
      4. Map Reduce with Hive
        1. Installing Hive
        2. Hive Word-Count Example
      5. Map Reduce with Perl
      6. Summary
    15. Chapter 5: Scheduling and Workflow
      1. An Overview of Scheduling
        1. The Capacity Scheduler
        2. The Fair Scheduler
      2. Scheduling in Hadoop V1
        1. V1 Capacity Scheduler
        2. V1 Fair Scheduler
      3. Scheduling in Hadoop V2
        1. V2 Capacity Scheduler
        2. V2 Fair Scheduler
      4. Using Oozie for Workflow
        1. Installing Oozie
        2. The Mechanics of the Oozie Workflow
        3. Creating an Oozie Workflow
        4. Running an Oozie Workflow
        5. Scheduling an Oozie Workflow
      5. Summary
    16. Chapter 6: Moving Data
      1. Moving File System Data
        1. The Cat Command
        2. The CopyFromLocal Command
        3. The CopyToLocal Command
        4. The Cp Command
        5. The Get Command
        6. The Put Command
        7. The Mv Command
        8. The Tail Command
      2. Moving Data with Sqoop
        1. Check the Database
        2. Install Sqoop
        3. Use Sqoop to Import Data to HDFS
        4. Use Sqoop to Import Data to Hive
      3. Moving Data with Flume
        1. Install Flume
        2. A Simple Agent
        3. Running the Agent
      4. Moving Data with Storm
        1. Install ZeroMQ
        2. Install JZMQ
        3. Install Storm
        4. Start and Check Zookeeper
        5. Run Storm
        6. An Example of Storm Topology
      5. Summary
    17. Chapter 7: Monitoring Data
      1. The Hue Browser
        1. Installing Hue
        2. Starting Hue
        3. Potential Errors
        4. Running Hue
      2. Ganglia
        1. Installing Ganglia
        2. Potential Errors
        3. The Ganglia Interface
      3. Nagios
        1. Installing Nagios
        2. Potential Errors
        3. The Nagios Interface
      4. Summary
    18. Chapter 8: Cluster Management
      1. The Ambari Cluster Manager
        1. Ambari Installation
      2. The Cloudera Cluster Manager
        1. Installing Cloudera Cluster Manager
        2. Running Cloudera Cluster Manager
      3. Apache Bigtop
        1. Installing Bigtop
        2. Running Bigtop Smoke Tests
      4. Summary
    19. Chapter 9: Analytics with Hadoop
      1. Cloudera Impala
        1. Installation of Impala
        2. Impala User Interfaces
        3. Uses of Impala
      2. Apache Hive
        1. Database Creation
        2. External Table Creation
        3. Hive UDFs
        4. Table Creation
        5. The SELECT Statement
        6. The WHERE Clause
        7. The Subquery
        8. Table Joins
        9. The INSERT Statement
        10. Organization of Table Data
      3. Apache Spark
        1. Installation of Spark
        2. Uses of Spark
        3. Spark SQL
      4. Summary
    20. Chapter 10: ETL with Hadoop
      1. Pentaho Data Integrator
        1. Installing Pentaho
        2. Running the Data Integrator
        3. Creating ETL
        4. Potential Errors
      2. Talend Open Studio
        1. Installing Open Studio for Big Data
        2. Running Open Studio for Big Data
        3. Creating the ETL
        4. Potential Errors
      3. Summary
    21. Chapter 11: Reporting with Hadoop
      1. Hunk
        1. Installing Hunk
        2. Running Hunk
        3. Creating Reports and Dashboards
        4. Potential Errors
      2. Talend Reports
        1. Installing Talend
        2. Running Talend
        3. Generating Reports
        4. Potential Errors
      3. Summary
    22. Index

    Product information

    • Title: Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset
    • Author(s): Michael Frampton
    • Release date: December 2014
    • Publisher(s): Apress
    • ISBN: 9781484200940