Book description
Finding patterns in massive event streams can be difficult, but learning how to find them doesn’t have to be. This unique hands-on guide shows you how to solve this and many other problems in large-scale data processing with simple, fun, and elegant tools that leverage Apache Hadoop. You’ll gain a practical, actionable view of big data by working with real data and real problems.
Perfect for beginners, this book’s approach will also appeal to experienced practitioners who want to brush up on their skills. Part I explains how Hadoop and MapReduce work, while Part II covers many analytic patterns you can use to process any data. As you work through several exercises, you’ll also learn how to use Apache Pig to process data.
- Learn the necessary mechanics of working with Hadoop, including how data and computation move around the cluster
- Dive into map/reduce mechanics and build your first map/reduce job in Python
- Understand how to run chains of map/reduce jobs in the form of Pig scripts
- Use a real-world dataset—baseball performance statistics—throughout the book
- Work with examples of several analytic patterns, and learn when and where you might use them
Table of contents
-
Preface
- What This Book Covers
- Who This Book Is For
- Who This Book Is Not For
- What This Book Does Not Cover
- Theory: Chimpanzee and Elephant
- Practice: Hadoop
- Example Code
- A Note on Python and MrJob
- Helpful Reading
- Feedback
- Conventions Used in This Book
- Using Code Examples
- Safari® Books Online
- How to Contact Us
- I. Introduction: Theory and Tools
- 1. Hadoop Basics
- 2. MapReduce
- 3. A Quick Look into Baseball
- 4. Introduction to Pig
- II. Tactics: Analytic Patterns
-
5. Map-Only Operations
- Pattern in Use
- Eliminating Data
- Selecting Records That Satisfy a Condition: FILTER and Friends
- Project Only Chosen Columns by Name
-
Transforming Records
- Transforming Records Individually Using FOREACH
- A Nested FOREACH Allows Intermediate Expressions
- Formatting a String According to a Template
- Assembling Literals with Complex Types
- Manipulating the Type of a Field
- Ints and Floats and Rounding, Oh My!
- Calling a User-Defined Function from an External Package
- Operations That Break One Table into Many
- Operations That Treat the Union of Several Tables as One
- Wrapping Up
-
6. Grouping Operations
- Grouping Records into a Bag by Key
- Group and Aggregate
-
Calculating the Distribution of Numeric Values with a Histogram
- Pattern in Use
- Binning Data for a Histogram
- Choosing a Bin Size
- Interpreting Histograms and Quantiles
- Binning Data into Exponentially Sized Buckets
- Creating Pig Macros for Common Stanzas
- Distribution of Games Played
- Extreme Populations and Confounding Factors
- Donât Trust Distributions at the Tails
- Calculating a Relative Distribution Histogram
- Reinjecting Global Values
- Calculating a Histogram Within a Group
- Dumping Readable Results
- The Summing Trick
- Wrapping Up
- References
-
7. Joining Tables
- Matching Records Between Tables (Inner Join)
- How a Join Works
- Enumerating a Many-to-Many Relationship
- Joining a Table with Itself (Self-Join)
- Joining Records Without Discarding Nonmatches (Outer Join)
- Selecting Only Records That Lack a Match in Another Table (Anti-Join)
- Selecting Only Records That Possess a Match in Another Table (Semi-Join)
- Wrapping Up
- 8. Ordering Operations
- 9. Duplicate and Unique Records
- Index
Product information
- Title: Big Data for Chimps
- Author(s):
- Release date: September 2015
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781491923900
You might also like
book
Programming Elastic MapReduce
Although you don’t need a large computing infrastructure to process massive amounts of data with Apache …
book
Big Data and Business Analytics
"The chapters in this volume offer useful case studies, technical roadmaps, lessons learned, and a few …
book
Pig Design Patterns
Simplify Hadoop programming to create complex end-to-end Enterprise Big Data solutions with Pig In Detail Pig …
book
Computer System Reliability
Computer systems have become an important element of the world economy, with billions of dollars spent …