Book description
Design and implement data processing, lifecycle management, and analytic workflows with the cutting-edge toolbox of Hadoop 2
In Detail
This book introduces you to the world of building data-processing applications with the wide variety of tools supported by Hadoop 2. Starting with the core components of the framework?HDFS and YARN?this book will guide you through how to build applications using a variety of approaches.
You will learn how YARN completely changes the relationship between MapReduce and Hadoop and allows the latter to support more varied processing approaches and a broader array of applications. These include real-time processing with Apache Samza and iterative computation with Apache Spark. Next up, we discuss Apache Pig and the dataflow data model it provides. You will discover how to use Pig to analyze a Twitter dataset.
With this book, you will be able to make your life easier by using tools such as Apache Hive, Apache Oozie, Hadoop Streaming, Apache Crunch, and Kite SDK. The last part of this book discusses the likely future direction of major Hadoop components and how to get involved with the Hadoop community.
What You Will Learn
- Write distributed applications using the MapReduce framework
- Go beyond MapReduce and process data in real time with Samza and iteratively with Spark
- Familiarize yourself with data mining approaches that work with very large datasets
- Prototype applications on a VM and deploy them to a local cluster or to a cloud infrastructure (Amazon Web Services)
- Conduct batch and real time data analysis using SQL-like tools
- Build data processing flows using Apache Pig and see how it enables the easy incorporation of custom functionality
- Define and orchestrate complex workflows and pipelines with Apache Oozie
- Manage your data lifecycle and changes over time
Table of contents
-
Learning Hadoop 2
- Table of Contents
- Learning Hadoop 2
- Credits
- About the Authors
- About the Reviewers
- www.PacktPub.com
- Preface
- 1. Introduction
- 2. Storage
-
3. Processing – MapReduce and Beyond
- MapReduce
- Java API to MapReduce
- Writing MapReduce programs
-
Walking through a run of a MapReduce job
- Startup
- Splitting the input
- Task assignment
- Task startup
- Ongoing JobTracker monitoring
- Mapper input
- Mapper execution
- Mapper output and reducer input
- Reducer input
- Reducer execution
- Reducer output
- Shutdown
- Input/Output
- InputFormat and RecordReader
- Hadoop-provided InputFormat
- Hadoop-provided RecordReader
- OutputFormat and RecordWriter
- Hadoop-provided OutputFormat
- Sequence files
- YARN
- YARN in the real world – Computation beyond MapReduce
- Summary
-
4. Real-time Computation with Samza
-
Stream processing with Samza
- How Samza works
- Samza high-level architecture
- Samza's best friend – Apache Kafka
- YARN integration
- An independent model
- Hello Samza!
- Building a tweet parsing job
- The configuration file
- Getting Twitter data into Kafka
- Running a Samza job
- Samza and HDFS
- Windowing functions
- Multijob workflows
- Tweet sentiment analysis
- Stateful tasks
- Summary
-
Stream processing with Samza
- 5. Iterative Computation with Spark
- 6. Data Analysis with Apache Pig
- 7. Hadoop and SQL
- 8. Data Lifecycle Management
- 9. Making Development Easier
- 10. Running a Hadoop Cluster
- 11. Where to Go Next
- Index
Product information
- Title: Learning Hadoop 2
- Author(s):
- Release date: February 2015
- Publisher(s): Packt Publishing
- ISBN: 9781783285518
You might also like
video
Learning Apache Hadoop
In this Introduction to Hadoop training course, expert author Rich Morrow will teach you the tools …
book
Hadoop MapReduce v2 Cookbook - Second Edition
Explore the Hadoop MapReduce v2 ecosystem to gain insights from very large datasets In Detail Starting …
book
Optimizing Hadoop for MapReduce
This book is the perfect introduction to sophisticated concepts in MapReduce and will ensure you have …
book
Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem
Get Started Fast with Apache Hadoop ® 2, YARN, and Today’s Hadoop Ecosystem With Hadoop 2.x …