book

Apache Oozie

by Mohammad Kamrul Islam, Aravind Srinivasan

May 2015

Beginner to intermediate

272 pages

7h 22m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Contents of This BookConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
1. Introduction to Oozie
Big Data ProcessingA Recurrent ProblemA Common Solution: OozieA Simple Oozie JobOozie ReleasesSome Oozie Usage Numbers
2. Oozie Concepts
Oozie ApplicationsOozie WorkflowsOozie CoordinatorsOozie BundlesParameters, Variables, and FunctionsApplication Deployment ModelOozie Architecture
3. Setting Up Oozie
Oozie DeploymentBasic InstallationsRequirementsBuild OozieInstall Oozie ServerHadoop ClusterStart and Verify the Oozie ServerAdvanced Oozie InstallationsConfiguring Kerberos SecurityDB SetupShared Library InstallationOozie Client Installations
4. Oozie Workflow Actions
WorkflowActionsAction Execution ModelAction DefinitionAction TypesMapReduce ActionJava ActionPig ActionFS ActionSub-Workflow ActionHive ActionDistCp ActionEmail ActionShell ActionSSH ActionSqoop ActionSynchronous Versus Asynchronous Actions
5. Workflow Applications
Outline of a Basic WorkflowControl Nodes<start> and <end><fork> and <join><decision><kill><OK> and <ERROR>Job ConfigurationGlobal ConfigurationJob XMLInline ConfigurationLauncher ConfigurationParameterizationEL VariablesEL FunctionsEL ExpressionsThe job.properties FileCommand-Line OptionThe config-default.xml FileThe <parameters> SectionConfiguration and Parameterization ExamplesLifecycle of a WorkflowAction States
6. Oozie Coordinator
Coordinator ConceptTriggering MechanismTime TriggerData Availability TriggerCoordinator Application and JobCoordinator ActionOur First Coordinator JobCoordinator SubmissionOozie Web Interface for Coordinator JobsCoordinator Job LifecycleCoordinator Action LifecycleParameterization of the CoordinatorEL Functions for FrequencyDay-Based FrequencyMonth-Based FrequencyExecution ControlsAn Improved Coordinator
7. Data Trigger Coordinator
Expressing Data DependencyDatasetExample: RollupParameterization of Dataset Instancescurrent(n)latest(n)Parameter Passing to WorkflowdataIn(eventName):dataOut(eventName)nominalTime()actualTime()dateOffset(baseTimeStamp, skipInstance, timeUnit)formatTime(timeStamp, formatString)A Complete Coordinator Application
8. Oozie Bundles
Bundle BasicsBundle DefinitionWhy Do We Need Bundles?Bundle SpecificationExecution ControlsBundle State Transitions

9. Advanced Topics
Managing Libraries in OozieOrigin of JARs in OozieDesign ChallengesManaging Action JARsSupporting the User’s JARJAR Precedence in classpathOozie SecurityOozie Security OverviewOozie to HadoopOozie Client to ServerSupporting Custom CredentialsSupporting New API in MapReduce ActionSupporting Uber JARCron SchedulingA Simple Cron-Based CoordinatorOozie Cron SpecificationEmulate Asynchronous Data ProcessingHCatalog-Based Data Dependency
10. Developer Topics
Developing Custom EL FunctionsRequirements for a New EL FunctionImplementing a New EL FunctionSupporting Custom Action TypesCreating a Custom Synchronous ActionOverriding an Asynchronous Action TypeImplementing the New ActionMain ClassTesting the New Main ClassCreating a New Asynchronous ActionWriting an Asynchronous Action ExecutorWriting the ActionMain ClassWriting Action’s SchemaDeploying the New Action TypeUsing the New Action Type
11. Oozie Operations
Oozie CLI ToolCLI SubcommandsUseful CLI CommandsOozie REST APIOozie Java ClientThe oozie-site.xml FileThe Oozie Purge ServiceJob MonitoringJMS-Based MonitoringOozie Instrumentation and MetricsReprocessingWorkflow ReprocessingCoordinator ReprocessingBundle ReprocessingServer TuningJVM TuningService SettingsOozie High AvailabilityDebugging in OozieOozie LogsDeveloping and Testing Oozie ApplicationsApplication Deployment TipsCommon Errors and DebuggingMiniOozie and LocalOozieThe Competition
Index

Content preview from Apache Oozie

Chapter 4. Oozie Workflow Actions

The previous chapter took us through the Oozie installation in detail. In this chapter, we will start looking at building full-fledged Oozie applications. The first step is to learn about Oozie workflows. Many users still use Oozie primarily as a workflow manager, and Oozie’s advanced features (e.g., the coordinator) are built on top of the workflow. This chapter will delve into how to define and deploy the individual action nodes that make up Oozie workflows. The individual action nodes are the heart and soul of a workflow because they do the actual processing and we will look at all the details around workflow actions in this chapter.

Workflow

As explained earlier in “A Recurrent Problem”, most Hadoop projects start simple, but quickly become complex. Let’s look at how a Hadoop data pipeline typically evolves in an enterprise. The first step in many big data analytic platforms is usually data ingestion from some upstream data source into Hadoop. This could be a weblog collection system or some data store in the cloud (e.g., Amazon S3). Hadoop DistCp, for example, is a common tool used to pull data from S3. Once the data is available, the next step is to run a simple analytic query, perhaps in the form of a Hive query, to get answers to some business question. This system will grow over time with more queries and different kinds of jobs. At some point soon, there will be a need to make this a recurring pipeline, typically a daily pipeline. The first ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781449369910Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Apache Oozie

by Mohammad Kamrul Islam, Aravind Srinivasan

Chapter 4. Oozie Workflow Actions

Workflow

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.