Live online training with Donald Miner

Tuesday, October 27, 2015 | 8:00am – 4:00pm PDT

Introduction to using Hadoop with Python

In this live, hands-on online training class, students will learn how to use Python with Apache Hadoop to store, process, and analyze large amounts of data with HDFS, MapReduce, HBase, Pig, Spark, and other Hadoop-based systems. Hadoop has quickly become the standard in distributed data processing but most aspects of using Hadoop require Java.

Today, there are a number of community-driven open source projects that support different aspects of the Hadoop ecosystem in Python. This tutorial goes through each of the Python Hadoop libraries and shows students how to use them by example.

Donald Miner

Donald Miner

Program

Who this is for:

It’s for Python programmers who want to learn more about Hadoop in a practical way. We are assuming they know Python, at least at a beginner level.

Prerequisite:

None

Hide agenda detailsShow agenda details

Introduction (30 minutes)

  • Introduce to speaker
  • Talk about my Python experience in distributed computing
  • Talk about my experience with Hadoop
  • Talk about how my life is so much better with Hadoop + Python working together
  • Motivation for using Python with Hadoop over Java

Hadoop Distributed File System (75 minutes)

  • Introduction to HDFS: how it works and an outline of its components
  • An overview of how to use HDFS through the bash shell (with quick exercise)
  • An overview of the **snakebite** library and what it can do (with a quick exercise)

MapReduce (75 minutes)

  • An introduction to MapReduce: how it works and an outline of its components
  • An example of a Java MapReduce job to motivate why doing it with Python is a good idea
  • How to use **hadoop-streaming** to write a MapReduce job with Python (with exercise)
  • How to use **mrjob** way to write a MapReduce job with Python, along with an overview of the mrjob package (with exercise)

Pig (45 minutes)

  • Introduction to Pig and the language
  • An example of how to write a Pig job that uses Python user-defined functions (with exercise)

Lunch break

Hive (45 minutes)

  • Introduction to Hive and the SQL-like language
  • An example of how to write a Hive job that uses Python user-defined functions (with exercise)

Luigi (45 minutes)

  • Tying together MapReduce jobs with **luigi**, along with an overview of the **luigi** package (with exercise)

PySpark (75 minutes)

  • An introduction to Spark: how it works and an outline of its components
  • Introduce the Java and Scala APIs
  • Introduce the Python APIs
  • How to use the PySpark APIs (with exercise)

HBase (15 minutes)

  • Introduction to HBase: how it works and an outline of its components
  • An overview of the options for HBase (there is no exercise here because the packages are not good)

Overview of other technologies and their Python interfaces (30 minutes)

  • Python and Apache Storm for stream processing
  • Python and Apache Spark for general purpose distributed computing
  • Python and Apache Accumulo (an alternative system that is similar to HBase)
  • Python and Hive, Impala, and other SQL-on-Hadoop solutions

Summary (15 minutes)

  • What stinks? There are some downsides to using Python with Hadoop in some cases, specifically in terms of support and performance. We’ll take some time to outline some of the more important ones.
  • What’s next? What kinds of projects are the community working on?
  • What needs to be done? What are some Hadoop and Python integration projects that are missing?
Back to top

About the instructor

Donald Miner

Donald Miner is founder of the data science firm Miner & Kasch and specializes in Hadoop enterprise architecture and applying machine learning to real-world business problems. Donald is author of the O’Reilly book MapReduce Design Patterns and the upcoming O’Reilly book Enterprise Hadoop. He has architected and implemented dozens of mission-critical and large-scale Hadoop systems within the U.S. Government and Fortune 500 companies. He has applied machine learning techniques to analyze data across several verticals, including financial, retail, telecommunications, health care, government intelligence, and entertainment. His PhD is from the University of Maryland, Baltimore County, where he focused on machine learning and multi-agent systems. He lives in Maryland with his wife and two young sons.

Back to top

Register now; October 27 is just around the corner.

Participants receive live online training + video + report

  • Access to the live workshop
  • Interaction with the instructor and fellow attendees
  • Real time Q&A sessions
  • Post-workshop video
  • An O’Reilly Certificate of Completion
Individual ticket: $599

Participate in this workshop from the convenience of your home, your office…whatever environment you find most comfortable and conducive to an intensive educational experience.

Group ticket: $1499

Project the workshop on a screen in a meeting room and invite your professional colleagues to participate. Learning alongside each other is a great team-building experience.

Once you have registered, further details about joining the workshop will be available in your members.oreilly.com account, along with related ebooks and files. After the event concludes, a video of the event will be added to your account.

Back to top