book

Enterprise Data Workflows with Cascading

Name: Enterprise Data Workflows with Cascading
Author: Paco Nathan
ISBN: 9781449359607

by Paco Nathan

July 2013

Intermediate to advanced

170 pages

4h 7m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Enterprise Data Workflows with Cascading
Preface
RequirementsEnterprise Data WorkflowsComplexity, More So Than BignessOrigins of the Cascading APIUsing Code ExamplesSafari® Books OnlineHow to Contact UsKudos
1. Getting Started
Programming Environment SetupExample 1: Simplest Possible App in CascadingBuild and RunCascading TaxonomyExample 2: The Ubiquitous Word CountFlow DiagramsPredictability at Scale
2. Extending Pipe Assemblies
Example 3: Customized OperationsScrubbing TokensExample 4: Replicated JoinsStop Words and Replicated JoinsComparing with Apache PigComparing with Apache Hive
3. Test-Driven Development
Example 5: TF-IDF ImplementationExample 6: TF-IDF with TestingA Word or Two About Testing
4. Scalding—A Scala DSL for Cascading
Why Use Scalding?Getting Started with ScaldingExample 3 in Scalding: Word Count with Customized OperationsA Word or Two about Functional ProgrammingExample 4 in Scalding: Replicated JoinsBuild Scalding Apps with GradleRunning on Amazon AWS
5. Cascalog—A Clojure DSL for Cascading
Why Use Cascalog?Getting Started with CascalogExample 1 in Cascalog: Simplest Possible AppExample 4 in Cascalog: Replicated JoinsExample 6 in Cascalog: TF-IDF with TestingCascalog Technology and Uses
6. Beyond MapReduce
Applications and OrganizationsLingual, a DSL for ANSI SQLUsing the SQL Command ShellUsing the JDBC DriverIntegrating with Desktop ToolsPattern, a DSL for Predictive Model Markup LanguageGetting Started with PatternPredefined App for PMMLIntegrating Pattern into Cascading AppsCustomer ExperimentsTechnology Roadmap for Pattern
7. The Workflow Abstraction
Key InsightsPattern LanguageLiterate ProgrammingSeparation of ConcernsFunctional Relational ProgrammingEnterprise vs. Start-Ups
8. Case Study: City of Palo Alto Open Data
Why Open Data?City of Palo AltoMoving from Raw Sources to Data ProductsCalibrating Metrics for the RecommenderSpatial IndexingPersonalizationRecommendationsBuild and RunKey Points of the Recommender Workflow

A. Troubleshooting Workflows
Build and Runtime ProblemsAnti-PatternsWorkflow BottlenecksOther Resources
Index
About the Author
Colophon
Copyright

Content preview from Enterprise Data Workflows with Cascading

Chapter 1. Getting Started

Programming Environment Setup

The following code examples show how to write apps in Cascading. The apps are intended to run on a laptop using Apache Hadoop in standalone mode, on a laptop running Linux or Unix (including Mac OS X). If you are using a Windows-based laptop, then many of these examples will not work, and generally speaking Hadoop does not behave well under Cygwin. However, you could run Linux, etc., in a virtual machine. Also, these examples are not intended to show how to set up and run a Hadoop cluster. There are other good resources about that—see Hadoop: The Definitive Guide by Tom White (O’Reilly).

First, you will need to have a few platforms and tools installed:

Java

Version 1.6.x was used to create these examples.
Get the JDK, not the JRE.
Install according to vendor instructions.

Apache Hadoop

Version 1.0.x is needed for Cascading 2.x used in these examples.
Be sure to install for “Standalone Operation.”

Gradle

Version 1.3 or later is required for some examples in this book.
Install according to vendor instructions.

Git

There are other ways to get code, but these examples show use of Git.
Install according to vendor instructions.

Our use of Gradle and Git implies that these commands will be downloading JARs, checking code repos, etc., so you will need an Internet connection for most of the examples in this book.

Next, set up your command-line environment. You will need to have the following environment variables set properly, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Special Edition Using Java™ 2 Enterprise Edition

Publisher Resources

ISBN: 9781449359584Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Enterprise Data Workflows with Cascading

by Paco Nathan

Chapter 1. Getting Started

Programming Environment Setup

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.