book

Hadoop Application Architectures

by Mark Grover, Ted Malaska, Jonathan Seidman, Gwen Shapira

July 2015

Intermediate to advanced

250 pages

10h 47m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
A Note About the Code ExamplesWho Should Read This BookWhy We Wrote This BookNavigating This BookConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
I. Architectural Considerations for Hadoop Applications
1. Data Modeling in Hadoop
Data Storage OptionsStandard File FormatsHadoop File TypesSerialization FormatsColumnar FormatsCompressionHDFS Schema DesignLocation of HDFS FilesAdvanced HDFS Schema DesignHDFS Schema Design SummaryHBase Schema DesignRow KeyTimestampHopsTables and RegionsUsing ColumnsUsing Column FamiliesTime-to-LiveManaging MetadataWhat Is Metadata?Why Care About Metadata?Where to Store Metadata?Examples of Managing MetadataLimitations of the Hive Metastore and HCatalogOther Ways of Storing MetadataConclusion
2. Data Movement
Data Ingestion ConsiderationsTimeliness of Data IngestionIncremental UpdatesAccess PatternsOriginal Source System and Data StructureTransformationsNetwork BottlenecksNetwork SecurityPush or PullFailure HandlingLevel of ComplexityData Ingestion OptionsFile TransfersConsiderations for File Transfers versus Other Ingest MethodsSqoop: Batch Transfer Between Hadoop and Relational DatabasesFlume: Event-Based Data Collection and ProcessingKafkaData ExtractionConclusion
3. Processing Data in Hadoop
MapReduceMapReduce OverviewExample for MapReduceWhen to Use MapReduceSparkSpark OverviewOverview of Spark ComponentsBasic Spark ConceptsBenefits of Using SparkSpark ExampleWhen to Use SparkAbstractionsPigPig ExampleWhen to Use PigCrunchCrunch ExampleWhen to Use CrunchCascadingCascading ExampleWhen to Use CascadingHiveHive OverviewExample of Hive CodeWhen to Use HiveImpalaImpala OverviewSpeed-Oriented DesignImpala ExampleWhen to Use ImpalaConclusion
4. Common Hadoop Processing Patterns
Pattern: Removing Duplicate Records by Primary KeyData Generation for Deduplication ExampleCode Example: Spark Deduplication in ScalaCode Example: Deduplication in SQLPattern: Windowing AnalysisData Generation for Windowing Analysis ExampleCode Example: Peaks and Valleys in SparkCode Example: Peaks and Valleys in SQLPattern: Time Series ModificationsUse HBase and VersioningUse HBase with a RowKey of RecordKey and StartTimeUse HDFS and Rewrite the Whole TableUse Partitions on HDFS for Current and Historical RecordsData Generation for Time Series ExampleCode Example: Time Series in SparkCode Example: Time Series in SQLConclusion
5. Graph Processing on Hadoop
What Is a Graph?What Is Graph Processing?How Do You Process a Graph in a Distributed System?The Bulk Synchronous Parallel ModelBSP by ExampleGiraphRead and Partition the DataBatch Process the Graph with BSPWrite the Graph Back to DiskPutting It All TogetherWhen Should You Use Giraph?GraphXJust Another RDDGraphX Pregel Interfacevprog()sendMessage()mergeMessage()Which Tool to Use?Conclusion
6. Orchestration
Why We Need Workflow OrchestrationThe Limits of ScriptingThe Enterprise Job Scheduler and HadoopOrchestration Frameworks in the Hadoop EcosystemOozie TerminologyOozie OverviewOozie WorkflowWorkflow PatternsPoint-to-Point WorkflowFan-Out WorkflowCapture-and-Decide WorkflowParameterizing WorkflowsClasspath DefinitionScheduling PatternsFrequency SchedulingTime and Data TriggersExecuting WorkflowsConclusion
7. Near-Real-Time Processing with Hadoop
Stream ProcessingApache StormStorm High-Level ArchitectureStorm TopologiesTuples and StreamsSpouts and BoltsStream GroupingsReliability of Storm ApplicationsExactly-Once ProcessingFault ToleranceIntegrating Storm with HDFSIntegrating Storm with HBaseStorm Example: Simple Moving AverageEvaluating StormTridentTrident Example: Simple Moving AverageEvaluating TridentSpark StreamingOverview of Spark StreamingSpark Streaming Example: Simple CountSpark Streaming Example: Multiple InputsSpark Streaming Example: Maintaining StateSpark Streaming Example: WindowingSpark Streaming Example: Streaming versus ETL CodeEvaluating Spark StreamingFlume InterceptorsWhich Tool to Use?Low-Latency Enrichment, Validation, Alerting, and IngestionNRT Counting, Rolling Averages, and Iterative ProcessingComplex Data PipelinesConclusion

II. Case Studies
8. Clickstream Analysis
Defining the Use CaseUsing Hadoop for Clickstream AnalysisDesign OverviewStorageIngestionThe Client TierThe Collector TierProcessingData DeduplicationSessionizationAnalyzingOrchestrationConclusion
9. Fraud Detection
Continuous ImprovementTaking ActionArchitectural Requirements of Fraud Detection SystemsIntroducing Our Use CaseHigh-Level DesignClient ArchitectureProfile Storage and RetrievalCachingHBase Data DefinitionDelivering Transaction Status: Approved or Denied?IngestPath Between the Client and FlumeNear-Real-Time and Exploratory AnalyticsNear-Real-Time ProcessingExploratory AnalyticsWhat About Other Architectures?Flume InterceptorsKafka to Storm or Spark StreamingExternal Business Rules EngineConclusion
10. Data Warehouse
Using Hadoop for Data WarehousingDefining the Use CaseOLTP SchemaData Warehouse: Introduction and TerminologyData Warehousing with HadoopHigh-Level DesignData Modeling and StorageIngestionData Processing and AccessAggregationsData ExportOrchestrationConclusion
A. Joins in Impala
Broadcast JoinsPartitioned Hash Join
Index

Content preview from Hadoop Application Architectures

Appendix A. Joins in Impala

We provided an overview of Impala and how it works in Chapter 3. In this appendix, we look at how Impala plans and executes a distributed join query. At the time of this writing, Impala has two join strategies: broadcast joins and partitioned hash joins.

Broadcast Joins

The broadcast join is the first and the default join pattern of Impala. In a broadcast join Impala takes the smaller data set and distributes it to all the Impala daemons involved with the query plan. Once distributed, the participating Impala daemons will store the data set as an in-memory hash table. Then each Impala daemon will read the parts of the larger data set that are local to its node and use the in-memory hash table to find the rows that match between both tables, (i.e., perform a hash join). There is no need to read the entire large data set into memory, so Impala uses a 1 GB buffer to read the large table and perform the joining part by part.

Figures A-1 and A-2 show how this works. Figure A-1 shows how each daemon will cache the smaller data set. While this join strategy is simple, it requires that the join occur with at least one small table.

It’s important to note that:

The smaller data set is now taking up memory on every node. So if you have three nodes with 50 GB of Impala memory, the smaller data set in a broadcast ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Apache Hadoop 3 Quick Start Guide

Hrishikesh Vijay Karambelkar

Architecting HBase Applications

Jean-Marc Spaggiari, Kevin O'Dell

Practical Hadoop Migration: How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications to NoSQL

Bhushan Lakhe

Hadoop in Practice, Second Edition

Alex Holmes

Publisher Resources

ISBN: 9781491910313Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Hadoop Application Architectures

by Mark Grover, Ted Malaska, Jonathan Seidman, Gwen Shapira

Appendix A. Joins in Impala

Broadcast Joins

Figure A-1. Smaller table caching in broadcast joins

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.