book

Hadoop with Python

by Zach Radtka, Donald Miner

October 2015

Intermediate to advanced

50 pages

1h 15m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Source Code
1. Hadoop Distributed File System (HDFS)
Overview of HDFSInteracting with HDFSCommon File OperationsHDFS Command ReferenceSnakebiteInstallationClient LibraryCLI ClientChapter Summary
2. MapReduce with Python
Data FlowMapShuffle and SortReduceHadoop StreamingHow It WorksA Python ExamplemrjobInstallationWordCount in mrjobWhat Is HappeningExecuting mrjobTop SalariesChapter Summary
3. Pig and Python
WordCount in PigWordCount in DetailRunning PigExecution ModesInteractive ModeBatch ModePig LatinStatementsLoading DataTransforming DataStoring DataExtending Pig with PythonRegistering a UDFA Simple Python UDFString ManipulationMost Recent MoviesChapter Summary
4. Spark with Python
WordCount in PySparkWordCount DescribedPySparkInteractive ShellSelf-Contained ApplicationsResilient Distributed Datasets (RDDs)Creating RDDs from CollectionsCreating RDDs from External SourcesRDD OperationsText Search with PySparkChapter Summary
5. Workflow Management with Python
InstallationWorkflowsTasksTargetParametersAn Example WorkflowTask.requiresTask.outputTask.runParametersExecutionHadoop WorkflowsConfiguration FileMapReduce in LuigiPig in LuigiChapter Summary

Overview

Hadoop is mostly written in Java, but that doesn't exclude the use of other programming languages with this distributed storage and processing framework, particularly Python. With this concise book, you’ll learn how to use Python with the Hadoop Distributed File System (HDFS), MapReduce, the Apache Pig platform and Pig Latin script, and the Apache Spark cluster-computing framework.

Authors Zachary Radtka and Donald Miner from the data science firm Miner & Kasch take you through the basic concepts behind Hadoop, MapReduce, Pig, and Spark. Then, through multiple examples and use cases, you'll learn how to work with these technologies by applying various Python tools.

Use the Python library Snakebite to access HDFS programmatically from within Python applications
Write MapReduce jobs in Python with mrjob, the Python MapReduce library
Extend Pig Latin with user-defined functions (UDFs) in Python
Use the Spark Python API (PySpark) to write Spark programs with Python
Learn how to use the Luigi Python workflow scheduler to manage MapReduce jobs and Pig scripts

Zachary Radtka, a platform engineer at Miner & Kasch, has extensive experience creating custom analytics that run on petabyte-scale data sets.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492048435Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills