Introduction to using Hadoop with Python - Online Training

In this live, hands-on online training class, students will learn how to use Python with Apache Hadoop to store, process, and analyze large amounts of data with HDFS, MapReduce, HBase, Pig, Spark, and other Hadoop-based systems. Hadoop has quickly become the standard in distributed data processing but most aspects of using Hadoop require Java.

Today, there are a number of community-driven open source projects that support different aspects of the Hadoop ecosystem in Python. This tutorial goes through each of the Python Hadoop libraries and shows students how to use them by example.

Donald Miner

Program

Who this is for:

It’s for Python programmers who want to learn more about Hadoop in a practical way. We are assuming they know Python, at least at a beginner level.

Prerequisite:

None

Hide agenda details Show agenda details

Introduction (30 minutes)

Introduce to speaker
Talk about my Python experience in distributed computing
Talk about my experience with Hadoop
Talk about how my life is so much better with Hadoop + Python working together
Motivation for using Python with Hadoop over Java

Hadoop Distributed File System (75 minutes)

Introduction to HDFS: how it works and an outline of its components
An overview of how to use HDFS through the bash shell (with quick exercise)
An overview of the **snakebite** library and what it can do (with a quick exercise)

MapReduce (75 minutes)

An introduction to MapReduce: how it works and an outline of its components
An example of a Java MapReduce job to motivate why doing it with Python is a good idea
How to use **hadoop-streaming** to write a MapReduce job with Python (with exercise)
How to use **mrjob** way to write a MapReduce job with Python, along with an overview of the mrjob package (with exercise)

Pig (45 minutes)

Introduction to Pig and the language
An example of how to write a Pig job that uses Python user-defined functions (with exercise)

Lunch break

Hive (45 minutes)

Introduction to Hive and the SQL-like language
An example of how to write a Hive job that uses Python user-defined functions (with exercise)

Luigi (45 minutes)

Tying together MapReduce jobs with **luigi**, along with an overview of the **luigi** package (with exercise)

PySpark (75 minutes)

An introduction to Spark: how it works and an outline of its components
Introduce the Java and Scala APIs
Introduce the Python APIs
How to use the PySpark APIs (with exercise)

HBase (15 minutes)

Introduction to HBase: how it works and an outline of its components
An overview of the options for HBase (there is no exercise here because the packages are not good)

Overview of other technologies and their Python interfaces (30 minutes)

Python and Apache Storm for stream processing
Python and Apache Spark for general purpose distributed computing
Python and Apache Accumulo (an alternative system that is similar to HBase)
Python and Hive, Impala, and other SQL-on-Hadoop solutions

Summary (15 minutes)

What stinks? There are some downsides to using Python with Hadoop in some cases, specifically in terms of support and performance. We’ll take some time to outline some of the more important ones.
What’s next? What kinds of projects are the community working on?
What needs to be done? What are some Hadoop and Python integration projects that are missing?

About the instructor

Donald Miner is founder of the data science firm Miner & Kasch and specializes in Hadoop enterprise architecture and applying machine learning to real-world business problems. Donald is author of the O’Reilly book MapReduce Design Patterns and the upcoming O’Reilly book Enterprise Hadoop. He has architected and implemented dozens of mission-critical and large-scale Hadoop systems within the U.S. Government and Fortune 500 companies. He has applied machine learning techniques to analyze data across several verticals, including financial, retail, telecommunications, health care, government intelligence, and entertainment. His PhD is from the University of Maryland, Baltimore County, where he focused on machine learning and multi-agent systems. He lives in Maryland with his wife and two young sons.

Register now; October 27 is just around the corner.

Participants receive live online training + video + report

Access to the live workshop
Interaction with the instructor and fellow attendees
Real time Q&A sessions
Post-workshop video
An O’Reilly Certificate of Completion

Individual ticket: $599

Participate in this workshop from the convenience of your home, your office…whatever environment you find most comfortable and conducive to an intensive educational experience.

Group ticket: $1499

Project the workshop on a screen in a meeting room and invite your professional colleagues to participate. Learning alongside each other is a great team-building experience.

Once you have registered, further details about joining the workshop will be available in your members.oreilly.com account, along with related ebooks and files. After the event concludes, a video of the event will be added to your account.