O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Sams Teach Yourself Hadoop in 24 Hours

Book Description

Apache Hadoop is the technology at the heart of the Big Data revolution, and Hadoop skills are in enormous demand. Now, in just 24 lessons of one hour or less, you can learn all the skills and techniques you'll need to deploy each key component of a Hadoop platform in your local environment or in the cloud, building a fully functional Hadoop cluster and using it with real programs and datasets. Each short, easy lesson builds on all that's come before, helping you master all of Hadoop's essentials, and extend it to meet your unique challenges. Apache Hadoop in 24 Hours, Sams Teach Yourself covers all this, and much more:

  • Understanding Hadoop and the Hadoop Distributed File System (HDFS)

  • Importing data into Hadoop, and process it there

  • Mastering basic MapReduce Java programming, and using advanced MapReduce API concepts

  • Making the most of Apache Pig and Apache Hive

  • Implementing and administering YARN

  • Taking advantage of the full Hadoop ecosystem

  • Managing Hadoop clusters with Apache Ambari

  • Working with the Hadoop User Environment (HUE)

  • Scaling, securing, and troubleshooting Hadoop environments

  • Integrating Hadoop into the enterprise

  • Deploying Hadoop in the cloud

  • Getting started with Apache Spark

  • Step-by-step instructions walk you through common questions, issues, and tasks; Q-and-As, Quizzes, and Exercises build and test your knowledge; "Did You Know?" tips offer insider advice and shortcuts; and "Watch Out!" alerts help you avoid pitfalls. By the time you're finished, you'll be comfortable using Apache Hadoop to solve a wide spectrum of Big Data problems.

    Table of Contents

    1. About This E-Book
    2. Title Page
    3. Copyright Page
    4. Contents at a glance
    5. Table of Contents
    6. Preface
    7. About the Author
    8. Acknowledgments
    9. Part I: Getting Started with Hadoop
      1. Hour 1: Introducing Hadoop
        1. Hadoop and a Brief History of Big Data
        2. Hadoop Explained
        3. The Commercial Hadoop Landscape
        4. Typical Hadoop Use Cases
        5. Summary
        6. Q&A
        7. Workshop
      2. Hour 2: Understanding the Hadoop Cluster Architecture
        1. HDFS Cluster Processes
        2. YARN Cluster Processes
        3. Hadoop Cluster Architecture and Deployment Modes
        4. Summary
        5. Q&A
        6. Workshop
      3. Hour 3: Deploying Hadoop
        1. Installation Platforms and Prerequisites
        2. Installing Hadoop
        3. Deploying Hadoop in the Cloud
        4. Summary
        5. Q&A
        6. Workshop
      4. Hour 4: Understanding the Hadoop Distributed File System (HDFS)
        1. HDFS Overview
        2. Review of the HDFS Roles
        3. NameNode Metadata
        4. SecondaryNameNode Role
        5. Interacting with HDFS
        6. Summary
        7. Q&A
        8. Workshop
      5. Hour 5: Getting Data into Hadoop
        1. Data Ingestion Using Apache Flume
        2. Ingesting Data from a Database using Sqoop
        3. Data Ingestion Using HDFS RESTful Interfaces
        4. Data Ingestion Considerations
        5. Summary
        6. Q&A
        7. Workshop
      6. Hour 6: Understanding Data Processing in Hadoop
        1. Introduction to MapReduce
        2. MapReduce Explained
        3. Word Count: The “Hello, World” of MapReduce
        4. MapReduce in Hadoop
        5. Summary
        6. Q&A
        7. Workshop
    10. Part II: Using Hadoop
      1. Hour 7: Programming MapReduce Applications
        1. Introducing the Java MapReduce API
        2. Writing a MapReduce Program in Java
        3. Advanced MapReduce API Concepts
        4. Using the MapReduce Streaming API
        5. Summary
        6. Q&A
        7. Workshop
      2. Hour 8: Analyzing Data in HDFS Using Apache Pig
        1. Introducing Pig
        2. Pig Latin Basics
        3. Loading Data into Pig
        4. Filtering, Projecting, and Sorting Data using Pig
        5. Built-in Functions in Pig
        6. Summary
        7. Q&A
        8. Workshop
      3. Hour 9: Using Advanced Pig
        1. Grouping Data in Pig
        2. Multiple Dataset Processing in Pig
        3. User-Defined Functions in Pig
        4. Automating Pig Using Macros and Variables
        5. Summary
        6. Q&A
        7. Workshop
      4. Hour 10: Analyzing Data Using Apache Hive
        1. Introducing Hive
        2. Creating Hive Objects
        3. Analyzing Data with Hive
        4. Data Output with Hive
        5. Summary
        6. Q&A
        7. Workshop
      5. Hour 11: Using Advanced Hive
        1. Automating Hive
        2. Complex Datatypes in Hive
        3. Text Processing Using Hive
        4. Optimizing and Managing Queries in Hive
        5. Summary
        6. Q&A
        7. Workshop
      6. Hour 12: Using SQL-on-Hadoop Solutions
        1. What Is SQL on Hadoop?
        2. Columnar Storage in Hadoop
        3. Introduction to Impala
        4. Introduction to Tez
        5. Introduction to HAWQ and Drill
        6. Summary
        7. Q&A
        8. Workshop
      7. Hour 13: Introducing Apache Spark
        1. Introducing Spark
        2. Spark Architecture
        3. Resilient Distributed Datasets in Spark
        4. Transformations and Actions in Spark
        5. Extensions to Spark
        6. Summary
        7. Q&A
        8. Workshop
      8. Hour 14: Using the Hadoop User Environment (HUE)
        1. Introducing HUE
        2. Installing, Configuring and Using HUE
        3. Summary
        4. Q&A
        5. Workshop
      9. Hour 15: Introducing NoSQL
        1. Introduction to NoSQL
        2. Introducing HBase
        3. Introducing Apache Cassandra
        4. Other NoSQL Implementations and the Future of NoSQL
        5. Summary
        6. Q&A
        7. Workshop
    11. Part III: Managing Hadoop
      1. Hour 16: Managing YARN
        1. YARN Revisited
        2. Administering YARN
        3. Application Scheduling in YARN
        4. Summary
        5. Q&A
        6. Workshop
      2. Hour 17: Working with the Hadoop Ecosystem
        1. Hadoop Ecosystem Overview
        2. Introduction to Oozie
        3. Stream Processing and Messaging in Hadoop
        4. Infrastructure and Security Projects
        5. Machine Learning, Visualization, and More Data Analysis Tools
        6. Summary
        7. Q&A
        8. Workshop
      3. Hour 18: Using Cluster Management Utilities
        1. Cluster Management Overview
        2. Deploying Clusters and Services Using Management Tools
        3. Configuration and Service Management Using Management Tools
        4. Monitoring, Troubleshooting, and Securing Hadoop Clusters Using Cluster Management Utilities
        5. Getting Started with the Cluster Management Utilities
        6. Summary
        7. Q&A
        8. Workshop
      4. Hour 19: Scaling Hadoop
        1. Linear Scalability with Hadoop
        2. Adding Nodes to your Hadoop Cluster
        3. Decommissioning Nodes from your Cluster
        4. Rebalancing a Hadoop Cluster
        5. Benchmarking Hadoop
        6. Summary
        7. Q&A
        8. Workshop
      5. Hour 20: Understanding Cluster Configuration
        1. Configuration in Hadoop
        2. HDFS Configuration Parameters
        3. YARN Configuration Parameters
        4. Ecosystem Component Configuration
        5. Summary
        6. Q&A
        7. Workshop
      6. Hour 21: Understanding Advanced HDFS
        1. HDFS Rack Awareness
        2. HDFS High Availability
        3. HDFS Federation
        4. HDFS Caching, Snapshotting, and Archiving
        5. Summary
        6. Q&A
        7. Workshop
      7. Hour 22: Securing Hadoop
        1. Hadoop Security Basics
        2. Securing Hadoop with Kerberos
        3. Perimeter Security Using Apache Knox
        4. Role-Based Access Control Using Ranger and Sentry
        5. Summary
        6. Q&A
        7. Workshop
      8. Hour 23: Administering, Monitoring and Troubleshooting Hadoop
        1. Administering Hadoop
        2. Troubleshooting Hadoop
        3. System and Application Monitoring in Hadoop
        4. Best Practices and Other Information Sources
        5. Summary
        6. Q&A
        7. Workshop
      9. Hour 24: Integrating Hadoop into the Enterprise
        1. Hadoop and the Data Center
        2. Use Case: Data Warehouse/ETL Offload
        3. Use Case: Event Storage and Processing
        4. Use Case: Predictive Analytics
        5. Summary
        6. Q&A
        7. Workshop
    12. Index
    13. Code Snippets