Book description
Finding patterns in massive event streams can be difficult, but learning how to find them doesn’t have to be. This unique hands-on guide shows you how to solve this and many other problems in large-scale data processing with simple, fun, and elegant tools that leverage Apache Hadoop. You’ll gain a practical, actionable view of big data by working with real data and real problems.
Perfect for beginners, this book’s approach will also appeal to experienced practitioners who want to brush up on their skills. Part I explains how Hadoop and MapReduce work, while Part II covers many analytic patterns you can use to process any data. As you work through several exercises, you’ll also learn how to use Apache Pig to process data.
- Learn the necessary mechanics of working with Hadoop, including how data and computation move around the cluster
- Dive into map/reduce mechanics and build your first map/reduce job in Python
- Understand how to run chains of map/reduce jobs in the form of Pig scripts
- Use a real-world dataset—baseball performance statistics—throughout the book
- Work with examples of several analytic patterns, and learn when and where you might use them
Table of contents
-
Preface
- What This Book Covers
- Who This Book Is For
- Who This Book Is Not For
- What This Book Does Not Cover
- Theory: Chimpanzee and Elephant
- Practice: Hadoop
- Example Code
- A Note on Python and MrJob
- Helpful Reading
- Feedback
- Conventions Used in This Book
- Using Code Examples
- Safari® Books Online
- How to Contact Us
- I. Introduction: Theory and Tools
- 1. Hadoop Basics
- 2. MapReduce
- 3. A Quick Look into Baseball
- 4. Introduction to Pig
- II. Tactics: Analytic Patterns
-
5. Map-Only Operations
- Pattern in Use
- Eliminating Data
- Selecting Records That Satisfy a Condition: FILTER and Friends
- Project Only Chosen Columns by Name
-
Transforming Records
- Transforming Records Individually Using FOREACH
- A Nested FOREACH Allows Intermediate Expressions
- Formatting a String According to a Template
- Assembling Literals with Complex Types
- Manipulating the Type of a Field
- Ints and Floats and Rounding, Oh My!
- Calling a User-Defined Function from an External Package
- Operations That Break One Table into Many
- Operations That Treat the Union of Several Tables as One
- Wrapping Up
-
6. Grouping Operations
- Grouping Records into a Bag by Key
- Group and Aggregate
-
Calculating the Distribution of Numeric Values with a Histogram
- Pattern in Use
- Binning Data for a Histogram
- Choosing a Bin Size
- Interpreting Histograms and Quantiles
- Binning Data into Exponentially Sized Buckets
- Creating Pig Macros for Common Stanzas
- Distribution of Games Played
- Extreme Populations and Confounding Factors
- Donât Trust Distributions at the Tails
- Calculating a Relative Distribution Histogram
- Reinjecting Global Values
- Calculating a Histogram Within a Group
- Dumping Readable Results
- The Summing Trick
- Wrapping Up
- References
-
7. Joining Tables
- Matching Records Between Tables (Inner Join)
- How a Join Works
- Enumerating a Many-to-Many Relationship
- Joining a Table with Itself (Self-Join)
- Joining Records Without Discarding Nonmatches (Outer Join)
- Selecting Only Records That Lack a Match in Another Table (Anti-Join)
- Selecting Only Records That Possess a Match in Another Table (Semi-Join)
- Wrapping Up
- 8. Ordering Operations
- 9. Duplicate and Unique Records
- Index
Product information
- Title: Big Data for Chimps
- Author(s):
- Release date: September 2015
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781491923900
You might also like
book
Designing Data-Intensive Applications
Data is at the center of many challenges in system design today. Difficult issues need to …
book
Mythical Man-Month, The: Essays on Software Engineering, Anniversary Edition, 2nd Edition
Few books on software project management have been as influential and timeless as The Mythical Man-Month. …
book
Practical Process Automation
In today's IT architectures, microservices and serverless functions play increasingly important roles in process automation. But …
book
Deciphering Data Architectures
Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern …