O'Reilly logo
live online training icon Live Online training

Data Pipelining with Luigi and Spark

How to build from scratch a functional distributed data processing pipeline

Brian Femiano

This training will give students some exposure to several different open-source data processing systems that scale to distributed workloads. Specifically, Luigi, Spark. We will work from the ground up understanding the bare minimum basics needed to start being productive with those technologies. Learning these frameworks on their own can be difficult for newcomers and even experienced data engineers, so this course will focus how we tie them together to create useful automated batch data processing workflows.

Other courses deep dive or give a general overview of 1 or 2 different frameworks. The problem is students can walk away not necessarily understanding how to apply a lot of the knowledge, or how it fits together to actually deliver functioning components that bring value to companies and teams. This workshop will be different in that it walk students step by step through how several of the more popular technologies fit together, and how major tech companies leverage them.

What you'll learn-and how you can apply it

  • How to build a very simple Spark data processing application.
  • How to build a very basic 2-step Luigi workflow that incorporates the spark job and necessary data.
  • How to run and compile the different parts of the pipeline in such a way that can be easily automated.
  • How Java, Python, and Scala and their respective strengths all play a cooperative role in the data engineering space.

This training course is for you because...

  • Beginner data engineers and data scientists who want some exposure to these tools in a way that demonstrates and not just talks about their value.
  • Mid-level data engineers and data scientists who have some experience with Spark or other frameworks, but not a clear vision of how to bring it into an environment where automated workflows are a necessity.
  • Beginner and mid-level data engineers who might be curious to see how easy it is to build Scala spark jobs and run them locally for testing.

Prerequisites

  • Some exposure to Apache Spark and data frames. Even if it’s just the getting started guide: https://spark.apache.org/docs/latest/quick-start.html Comfortable coding in Python 2.7
  • Comfortable getting around a Linux environment with very basic commands (ssh, ls, ps, etc.)
  • Some exposure Scala is useful but not necessary. The workshop will involve writing in Scala, but each step will be explained in just enough detail such that students can get the gist of what’s happening. Many developers write Scala Spark applications who are just novices with Scala.
  • Some exposure to Luigi is a bonus, or if not that other batch workflow tools like Oozie and Azkaban will help. This is not at all required though. The workshop will cover exactly what’s happening step-by-step.

Course Set-up

  • The following directions will help students install the Virtual Machine: https://github.com/bfemiano/song_plays_workshop_tutorial/blob/master/VM_Setup.md
  • For students who are courageous and want to forgo the VM, the github page will also contain step-by-step instructions for installing all of the VM dependencies on their local workstation. Some people who run MacOSX/Linux might be interested in walking away from this workshop with a functional workstation for repeating the steps. Most will probably want the VM though.

Recommended Preparation

Recommended Follow-up

About your instructor

  • I grew up in a suburb in MD called Columbia, which is just outside Baltimore in Howard County. I have a younger sister who is very close in age and always kept me in check. I fell in love with computers and technology at a very young age. I loved playing games on and messing around with my dad's PC trying to break things for the fun of it. My main group of friends and I met when we were about 7 years old and mostly just stayed indoors playing video games. We became so close we still think of each other as brothers to this day. My other interests like Soccer were more physically active and definitely helped me become a more well-rounded person. When I went to college in Virginia I took my interest in computers to another level of seriousness. I decided I wanted to work professionally in the field. The rest was history. I now live in New York with my wife Rachel and two young boys Everett and Nathan. They mean the world to me.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Environment Setup (1 hour)

  • Instructor will help students get setup with the VM. Offer manual directions for local workstation configuration.
  • Students will get up and running with VirtualBox/Vagrant and the supplied VM. Explore some of the commands. Students who opt for local workstation configuration will test a series of install commands to verify they have the correct versions.

Break: 10 min

Segment 2: Luigi setup (1 hour)

  • Instructor will explain what Luigi is and why it’s useful. Explain the core design patterns found in the framework.
  • Students will write in Python a two-step Luigi pipeline from scratch.

Break: 10 min

Segment 3: Write spark job (1 hour)

  • Instructor will walk through writing a Scala Spark job and compiling a fat jar using Gradle.
  • Students will implement the Spark job from scratch and learn how to compile it to a jar for spark-submit.
  • Students will implement the Spark job from scratch and learn how to compile it to a jar for spark-submit.

Break: 10 min

Segment 4: Putting it all together (30 min)

  • Instructor will help students get their assignment running end to end.
  • Students will see how their Luigi pipeline runs any dependencies and launches the Spark job for a given set of arguments.