book

Programming Pig, 2nd Edition

by Alan Gates, Daniel Dai

November 2016

Intermediate to advanced

368 pages

9h 59m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who Should Read This BookWhat’s New in This EditionConventions Used in This BookCode Examples in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments from the First Edition (Alan Gates)Second Edition Acknowledgments (Alan Gates and Daniel Dai)
1. What Is Pig?
Pig Latin, a Parallel Data Flow LanguageComparing Query and Data Flow LanguagesPig on HadoopMapReduce’s “Hello World”How Pig Differs from MapReduceWhat Is Pig Useful For?The Pig PhilosophyPig’s History
2. Installing and Running Pig
Downloading and Installing PigDownloading the Pig Package from ApacheInstallation and SetupDownloading Pig Artifacts from MavenDownloading the SourceDownloading Pig from DistributionsRunning PigRunning Pig Locally on Your MachineRunning Pig on Your Hadoop ClusterRunning Pig in the CloudCommand-Line and Configuration OptionsReturn CodesGruntEntering Pig Latin Scripts in GruntHDFS Commands in GruntControlling Pig from GruntRunning External CommandsOthers
3. Pig’s Data Model
TypesScalar TypesComplex TypesNullsSchemasCasts
4. Introduction to Pig Latin
Preliminary MattersCase SensitivityCommentsInput and OutputloadstoredumpRelational Operationsforeachfiltergrouporder bydistinctjoinlimitsampleparallelUser-Defined FunctionsRegistering Java UDFsRegistering UDFs in Scripting Languagesdefine and UDFsCalling Static Java FunctionsCalling Hive UDFs
5. Advanced Pig Latin
Advanced Relational OperationsAdvanced Features of foreachCasting a Relation to a ScalarUsing Different Join ImplementationscogroupunioncrossMore on Nested foreachrankcubeassertIntegrating Pig with Executables and Native Jobsstreamnativesplit and Nonlinear Data FlowsControlling ExecutionsetSetting the PartitionerPig Latin PreprocessorParameter SubstitutionMacrosIncluding Other Pig Latin Scripts
6. Developing and Testing Pig Latin Scripts
Development ToolsSyntax Highlighting and CheckingdescribeexplainillustratePig StatisticsJob StatusDebugging TipsTesting Your Scripts with PigUnit
7. Making Pig Fly
Writing Your Scripts to Perform WellFilter Early and OftenProject Early and OftenSet Up Your Joins ProperlyUse Multiquery When PossibleChoose the Right Data TypeSelect the Right Level of ParallelismWriting Your UDFs to PerformTuning Pig and Hadoop for Your JobUsing Compression in Intermediate ResultsData Layout OptimizationMap-Side AggregationThe JAR CacheProcessing Small Jobs LocallyBloom FiltersSchema Tuple OptimizationDealing with Failures
8. Embedding Pig
Embedding Pig Latin in Scripting LanguagesCompilingBindingRunningUtility MethodsUsing the Pig Java APIsPigServerPigRunner
9. Writing Evaluation and Filter Functions
Writing an Evaluation Function in JavaWhere Your UDF Will RunEvaluation Function BasicsInput and Output SchemasError Handling and Progress ReportingConstructors and Passing Data from Frontend to BackendOverloading UDFsVariable-Length Input SchemaMemory Issues in Eval FuncsCompile-Time EvaluationShipping JARs AutomaticallyThe Algebraic InterfaceThe Accumulator InterfaceWriting Filter FunctionsWriting Evaluation Functions in Scripting LanguagesJython UDFsJavaScript UDFsJRuby UDFsGroovy UDFsStreaming Python UDFsComparing Scripting Language UDF Features

10. Writing Load and Store Functions
Load FunctionsFrontend Planning FunctionsPassing Information from the Frontend to the BackendBackend Data ReadingAdditional Load Function InterfacesStore FunctionsStore Function Frontend PlanningStore Functions and UDFContextWriting DataFailure CleanupStoring MetadataShipping JARs AutomaticallyHandling Bad Records
11. Pig on Tez
What Is Tez?Running Pig on TezPotential Differences When Running on TezUDFsUsing PigRunnerTesting and DebuggingPig on Tez InternalsMultiple Backends in PigThe Tez OptimizerOperators and ImplementationAutomatic Parallelism
12. Pig and Other Members of the Hadoop Community
Pig and HiveHCatalogWebHCatCascadingSparkNoSQL DatabasesHBaseAccumuloCassandraDataFuOozie
13. Use Cases and Programming Examples
Sparse Tuplesk-Meansintersect and exceptPig at Yahoo!Apache Pig Use Cases at Yahoo!Large-Scale ETL with Apache PigFeatures That Make Pig AttractivePig on TezMoving ForwardPig at Particle NewsCompute Arrival Rate and Conversion RateCompute Sessions Triggered by a Push
A. Built-in User Defined Functions and PiggyBank
Built-in UDFsBuilt-in Load and Store FunctionsBuilt-in Evaluation and Filter FunctionsPiggyBank
Index

Content preview from Programming Pig, 2nd Edition

Chapter 4. Introduction to Pig Latin

It is time to dig into Pig Latin. This chapter provides you with the basics of Pig Latin, enough to write your first useful scripts. More advanced features of Pig Latin are covered in Chapter 5.

Preliminary Matters

Pig Latin is a data flow language. Each processing step results in a new dataset, or relation. In input = load 'data', input is the name of the relation that results from loading the dataset data. A relation name is referred to as an alias. Relation names look like variables, but they are not. Once made, an assignment is permanent. It is possible to reuse relation names; for example, this is legitimate:

A = load 'NYSE_dividends' (exchange, symbol, date, dividends);
A = filter A by dividends > 0;
A = foreach A generate UPPER(symbol);

However, it is not recommended. It looks here as if you are reassigning A, but really you are creating new relations called A, and losing track of the old relations called A. Pig is smart enough to keep up, but it still is not a good practice. It leads to confusion when trying to read your programs (which A am I referring to?) and when reading error messages.

In addition to relation names, Pig Latin also has field names. They name a field (or column) in a relation. In the previous snippet of Pig Latin, dividends and symbol are examples of field names. These are somewhat like variables in that they will contain a different value for each record as it passes through the pipeline, but you cannot assign values to ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491937082Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Programming Pig, 2nd Edition

by Alan Gates, Daniel Dai

Chapter 4. Introduction to Pig Latin

Preliminary Matters

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.