book

MapReduce Design Patterns

by Donald Miner, Adam Shook

December 2012

Intermediate to advanced

247 pages

6h 48m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Dedication
Preface
Intended AudiencePattern FormatThe Examples in This BookConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
1. Design Patterns and MapReduce
Design PatternsMapReduce HistoryMapReduce and Hadoop RefresherHadoop Example: Word CountPig and Hive
2. Summarization Patterns
Numerical SummarizationsPattern DescriptionIntentMotivationApplicabilityStructureConsequencesKnown usesResemblancesPerformance analysisNumerical Summarization ExamplesMinimum, maximum, and count exampleMinMaxCountTuple codeMapper codeReducer codeCombiner optimizationData flow diagramAverage exampleMapper codeReducer codeCombiner optimizationData flow diagramMedian and standard deviationMapper codeReducer codeCombiner optimizationMemory-conscious median and standard deviationMapper codeReducer codeCombiner optimizationData flow diagramInverted Index SummarizationsPattern DescriptionIntentMotivationApplicabilityStructureConsequencesPerformance analysisInverted Index ExampleWikipedia reference inverted indexMapper codeReducer codeCombiner optimizationCounting with CountersPattern DescriptionIntentMotivationApplicabilityStructureConsequencesKnown usesPerformance analysisCounting with Counters ExampleNumber of users per stateMapper codeDriver code
3. Filtering Patterns
FilteringPattern DescriptionIntentMotivationApplicabilityStructureConsequencesKnown usesResemblancesPerformance analysisFiltering ExamplesDistributed grepMapper codeSimple Random SamplingMapper CodeBloom FilteringPattern DescriptionIntentMotivationApplicabilityStructureConsequencesKnown usesResemblancesPerformance analysisBloom Filtering ExamplesHot listBloom filter trainingMapper codeHBase Query using a Bloom filterMapper CodeTop TenPattern DescriptionIntentMotivationApplicabilityStructureConsequencesKnown usesResemblancesPerformance analysisTop Ten ExamplesTop ten users by reputationMapper codeReducer codeDistinctPattern DescriptionIntentMotivationApplicabilityStructureConsequencesKnown usesResemblancesPerformance analysisDistinct ExamplesDistinct user IDsMapper codeReducer codeCombiner optimization
4. Data Organization Patterns
Structured to HierarchicalPattern DescriptionIntentMotivationApplicabilityStructureConsequencesKnown usesResemblancesPerformance analysisStructured to Hierarchical ExamplesPost/comment building on StackOverflowDriver codeMapper codeReducer codeQuestion/answer building on StackOverflowMapper codeReducer codePartitioningPattern DescriptionIntentMotivationApplicabilityStructureConsequencesKnown usesResemblancesPerformance analysisPartitioning ExamplesPartitioning users by last access dateDriver codeMapper codePartitioner codeReducer codeBinningPattern DescriptionIntentMotivationStructureConsequencesResemblancesPerformance analysisBinning ExamplesBinning by Hadoop-related tagsDriver codeMapper codeTotal Order SortingPattern DescriptionIntentMotivationApplicabilityStructureConsequencesResemblancesPerformance analysisTotal Order Sorting ExamplesSort users by last visitDriver codeAnalyze mapper codeOrder mapper codeOrder reducer codeShufflingPattern DescriptionIntentMotivationStructureConsequencesResemblancesPerformance analysisShuffle ExamplesAnonymizing StackOverflow commentsMapper codeReducer code
5. Join Patterns
A Refresher on JoinsReduce Side JoinPattern DescriptionIntentMotivationApplicabilityStructureConsequencesResemblancesPerformance analysisReduce Side Join ExampleUser and comment joinDriver codeUser mapper codeComment mapper codeReducer codeCombiner optimizationReduce Side Join with Bloom FilterReputable user and comment joinUser mapper codeComment mapper codeReplicated JoinPattern DescriptionIntentMotivationApplicabilityStructureConsequencesResemblancesPerformance analysisReplicated Join ExamplesReplicated user comment exampleMapper codeComposite JoinPattern DescriptionIntentMotivationApplicabilityStructureConsequencesPerformance analysisComposite Join ExamplesComposite user comment joinDriver codeMapper codeReducer and combinerCartesian ProductPattern DescriptionIntentMotivationApplicabilityStructureConsequencesResemblancesPerformance AnalysisCartesian Product ExamplesComment ComparisonInput format codeDriver codeRecord reader codeMapper code
6. Metapatterns
Job ChainingWith the DriverJob Chaining ExamplesBasic job chainingJob one mapperJob one reducerJob two mapperDriver codeParallel job chainingMapper codeReducer codeDriver codeWith Shell ScriptingBash exampleBash scriptSample runWith JobControlJob control exampleMain methodHelper methodsChain FoldingThe ChainMapper and ChainReducer ApproachChain Folding ExampleBin users by reputationParsing mapper codeReplicated join mapper codeReducer codeBinning mapper codeDriver codeJob MergingJob Merging ExamplesAnonymous comments and distinct usersTaggedText WritableComparableMerged mapper codeMerged reducer codeDriver code
7. Input and Output Patterns
Customizing Input and Output in HadoopInputFormatRecordReaderOutputFormatRecordWriterGenerating DataPattern DescriptionIntentMotivationStructureConsequencesResemblancesPerformance analysisGenerating Data ExamplesGenerating random StackOverflow commentsDriver codeInputSplit codeInputFormat codeRecordReader codeExternal Source OutputPattern DescriptionIntentMotivationStructureConsequencesPerformance analysisExternal Source Output ExampleWriting to Redis instancesOutputFormat codeRecordWriter codeMapper CodeDriver CodeExternal Source InputPattern DescriptionIntentMotivationStructureConsequencesPerformance analysisExternal Source Input ExampleReading from Redis InstancesInputSplit codeInputFormat codeRecordReader codeDriver codePartition PruningPattern DescriptionIntentMotivationStructureConsequencesResemblancesPerformance analysisPartition Pruning ExamplesPartitioning by last access date to Redis instancesCustom WritableComparable codeOutputFormat codeRecordWriter codeMapper codeDriver codeQuerying for user reputation by last access dateInputSplit codeInputFormat codeRecordReader codeDriver code
8. Final Thoughts and the Future of Design Patterns
Trends in the Nature of DataImages, Audio, and VideoStreaming DataThe Effects of YARNPatterns as a Library or ComponentHow You Can Help

A. Bloom Filters
OverviewUse CasesRepresenting a Data SetReduce Queries to External DatabaseGoogle BigTableDownsidesTweaking Your Bloom Filter
Index
About the Authors
Colophon
Copyright

Content preview from MapReduce Design Patterns

Chapter 1. Design Patterns and MapReduce

MapReduce is a computing paradigm for processing data that resides on hundreds of computers, which has been popularized recently by Google, Hadoop, and many others. The paradigm is extraordinarily powerful, but it does not provide a general solution to what many are calling “big data,” so while it works particularly well on some problems, some are more challenging. This book will teach you what problems are amenable to the MapReduce paradigm, as well as how to use it effectively.

At first glance, many people do not realize that MapReduce is more of a framework than a tool. You have to fit your solution into the framework of map and reduce, which in some situations might be challenging. MapReduce is not a feature, but rather a constraint.

This makes problem solving easier and harder. It provides clear boundaries for what you can and cannot do, making the number of options you have to consider fewer than you may be used to. At the same time, figuring out how to solve a problem with constraints requires cleverness and a change in thinking.

Learning MapReduce is a lot like learning recursion for the first time: it is challenging to find the recursive solution to the problem, but when it comes to you, it is clear, concise, and elegant. In many situations you have to be conscious of system resources being used by the MapReduce job, especially inter-cluster network utilization. The tradeoff of being confined to the MapReduce framework is the ability ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Design Patterns for Cloud Native Applications

Publisher Resources

ISBN: 9781449341954Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

MapReduce Design Patterns

by Donald Miner, Adam Shook

Chapter 1. Design Patterns and MapReduce

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.