book

Professional Hadoop Solutions

by Boris Lublinsky, Kevin T. Smith, Alexey Yakubovich

September 2013

Intermediate to advanced

504 pages

14h 22m

English

Wrox

Read now

Unlock full access

Cover
Contents
Chapter 1: Big Data and the Hadoop Ecosystem
Big Data Meets HadoopThe Hadoop EcosystemHadoop Core ComponentsHadoop DistributionsDeveloping Enterprise Applications with HadoopSummary
Chapter 2: Storing Data in Hadoop
HDFSHBaseCombining HDFS and HBase for Effective Data StorageUsing Apache AvroManaging Metadata with HCatalogChoosing an Appropriate Hadoop Data Organization for Your ApplicationsSummary
Chapter 3: Processing Your Data with MapReduce
Getting to Know MapReduceYour First MapReduce ApplicationDesigning MapReduce ImplementationsSummary
Chapter 4: Customizing MapReduce Execution
Controlling MapReduce Execution with InputFormatReading Data Your Way with Custom RecordReadersOrganizing Output Data with Custom Output FormatsWriting Data Your Way with Custom RecordWritersOptimizing Your MapReduce Execution with a CombinerControlling Reducer Execution with PartitionersUsing Non-Java Code with HadoopSummary
Chapter 5: Building Reliable MapReduce Apps
Unit Testing MapReduce ApplicationsLocal Application Testing with EclipseUsing Logging for Hadoop TestingReporting Metrics with Job CountersDefensive Programming in MapReduceSummary
Chapter 6: Automating Data Processing with Oozie
Getting to Know OozieOozie WorkflowOozie CoordinatorOozie BundleOozie Parameterization with Expression LanguageOozie Job Execution ModelAccessing OozieOozie SLASummary
Chapter 7: Using Oozie
Validating Information about Places Using ProbesDesigning Place Validation Based on ProbesDesigning Oozie WorkflowsImplementing Oozie Workflow ApplicationsImplementing Workflow ActivitiesImplementing Oozie Coordinator ApplicationsImplementing Oozie Bundle ApplicationsDeploying, Testing, and Executing Oozie ApplicationsUsing the Oozie Console to Get Information about Oozie ApplicationsSummary
Chapter 8: Advanced Oozie Features
Building Custom Oozie Workflow ActionsAdding Dynamic Execution to Oozie WorkflowsUsing the Oozie Java APIUsing Uber Jars with Oozie ApplicationsData Ingestion ConveyerSummary

Chapter 9: Real-Time Hadoop
Real-Time Applications in the Real WorldUsing HBase for Implementing Real-Time ApplicationsUsing Specialized Real-Time Hadoop Query SystemsUsing Hadoop-Based Event-Processing SystemsSummary
Chapter 10: Hadoop Security
A Brief History: Understanding Hadoop Security ChallengesAuthenticationAuthorizationOozie Authentication and AuthorizationNetwork EncryptionSecurity Enhancements with Project RhinoPutting it All Together — Best Practices for Securing HadoopSummary
Chapter 11: Running Hadoop Applications on AWS
Getting to Know AWSOptions for Running Hadoop on AWSUnderstanding the EMR-Hadoop RelationshipUsing AWS S3Automating EMR Job Flow Creation and Job ExecutionOrchestrating Job Execution in EMRSummary
Chapter 12: Building Enterprise Security Solutions for Hadoop Implementations
Security Concerns for Enterprise ApplicationsWhat Hadoop Security Doesn’t Natively Provide for Enterprise ApplicationsApproaches for Securing Enterprise Applications Using HadoopSummary
Chapter 13: Hadoop’s Future
Simplifying MapReduce Programming with DSLsFaster, More Scalable ProcessingSecurity EnhancementsEmerging TrendsSummary
Appendix: Useful Reading
Introduction
Advertisements

Content preview from Professional Hadoop Solutions

Chapter 6 Automating Data Processing with Oozie

WHAT’S IN THIS CHAPTER?

Understanding Oozie fundamentals
Getting to know the main Oozie components and programming for them
Understanding the overall Oozie execution model
Understanding Oozie support for a Service Level Agreement

As you have learned in previous chapters, MapReduce jobs constitute the main execution engine of the Hadoop ecosystem. Over the years, solution architects have used Hadoop on complex projects. These architects have learned that utilizing MapReduce jobs without a higher-level framework for orchestration and control of their execution can result in complexities and potential pitfalls because of the following reasons:

Many data processing algorithms require execution of several MapReduce jobs in a certain sequence. (For specific examples of this, see Chapter 3.) For simple tasks, a sequence might be known in advance. Many times, however, a sequence depends on the intermediate execution results of multiple jobs. Without using a higher-level framework for controlling sequence execution, management of these tasks becomes quite difficult.
It is often advantageous to execute a collection of MapReduce jobs based on time, certain events, and the presence of certain resources (for example, HDFS files). Using MapReduce alone typically requires the manual execution of jobs, and the more tasks you have, the more complex this becomes.

You can alleviate these potential difficulties by leveraging the Apache Oozie Workflow ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781118824184Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Professional Hadoop Solutions

by Boris Lublinsky, Kevin T. Smith, Alexey Yakubovich

Chapter 6

Automating Data Processing with Oozie

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.