Book description
Discover how Apache Hadoop can unleash the power of your data. This comprehensive resource shows you how to build and maintain reliable, scalable, distributed systems with the Hadoop framework -- an open source implementation of MapReduce, the algorithm on which Google built its empire. Programmers will find details for analyzing datasets of any size, and administrators will learn how to set up and run Hadoop clusters.
This revised edition covers recent changes to Hadoop, including new features such as Hive, Sqoop, and Avro. It also provides illuminating case studies that illustrate how Hadoop is used to solve specific problems. Looking to get the most out of your data? This is your book.
- Use the Hadoop Distributed File System (HDFS) for storing large datasets, then run distributed computations over those datasets with MapReduce
- Become familiar with Hadoop’s data and I/O building blocks for compression, data integrity, serialization, and persistence
- Discover common pitfalls and advanced features for writing real-world MapReduce programs
- Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud
- Use Pig, a high-level query language for large-scale data processing
- Analyze datasets with Hive, Hadoop’s data warehousing system
- Take advantage of HBase, Hadoop’s database for structured and semi-structured data
- Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems
"Now you have the opportunity to learn about Hadoop from a master -- not only of the technology, but also of common sense and plain talk."
--Doug Cutting, Cloudera
Table of contents
- Hadoop: The Definitive Guide
- Dedication
- A Note Regarding Supplemental Files
- Foreword
- Preface
- 1. Meet Hadoop
- 2. MapReduce
- 3. The Hadoop Distributed Filesystem
- 4. Hadoop I/O
- 5. Developing a MapReduce Application
- 6. How MapReduce Works
- 7. MapReduce Types and Formats
- 8. MapReduce Features
-
9. Setting Up a Hadoop Cluster
- Cluster Specification
- Cluster Setup and Installation
- SSH Configuration
- Hadoop Configuration
- Security
- Benchmarking a Hadoop Cluster
- Hadoop in the Cloud
- 10. Administering Hadoop
- 11. Pig
- 12. Hive
- 13. HBase
- 14. ZooKeeper
- 15. Sqoop
-
16. Case Studies
- Hadoop Usage at Last.fm
- Hadoop and Hive at Facebook
- Nutch Search Engine
- Log Processing at Rackspace
- Cascading
- TeraByte Sort on Apache Hadoop
- Using Pig and Wukong to Explore Billion-edge Network Graphs
- A. Installing Apache Hadoop
- B. Cloudera’s Distribution for Hadoop
- C. Preparing the NCDC Weather Data
- Index
- About the Author
- Colophon
- Copyright
Product information
- Title: Hadoop: The Definitive Guide, 2nd Edition
- Author(s):
- Release date: October 2010
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781449389734
You might also like
book
Debugging
The rules of battle for tracking down -- and eliminating -- hardware and software bugs. When …
book
Concurrency in Go
Concurrency can be notoriously difficult to get right, but fortunately, the Go open source programming language …
book
ZooKeeper
Building distributed applications is difficult enough without having to coordinate the actions that make them work. …
book
SQL for Data Analysis
With the explosion of data, computing power, and cloud data warehouses, SQL has become an even …