Skip to content
O'Reilly home

Building Distributed Pipelines for Data Science Using Kafka, Spark, and Cassandra

Learn how to introduce a distributed data science pipeline in your organization

This event has ended.

What you’ll learn and how you can apply it

By the end of this course, you'll have a solid understanding of:

  • The most important technologies for a distributed pipeline, when they should be used—and how
  • How to integrate scalable technologies into your company’s existing data architecture
  • How to build a successful, scalable, elastic, distributed pipeline using a lean approach

This live event is for you because…

  • You’re a data scientist with experience with data modeling, business intelligence, or a traditional data pipeline and need to deal with bigger or faster data

  • You’re a software or data engineer with experience in architecting solutions in Scala, Java, or Python and you need to integrate scalable technologies in your company’s architecture


  • Intermediate knowledge of an object-oriented language and basic knowledge of a functional programming language, as well as basic experience with a JVM

  • Understanding of classic web architecture and service-oriented architecture

  • Basic understanding of ETL, streaming data, and distributed data architectures

  • Intermediate understanding of Docker and UNIX, as well as some basic knowledge about networks (IP, DNS, SSH, etc.)


For the online training class, we'll be using as the simplest environment to run most of the pipeline.This environment will be available from a single docker image. Please click the link below and follow the setup instructions.

Recommended Preparation

Scala and the JVM as a big data platform: Lessons from Apache Spark

Architecture Patterns Part 1

Introduction to Big Data

Learning Docker

Learning DNS


The timeframes are only estimates and may vary according to how the class is progressing.

Day 1

  • Introduction, Spark, Spark Notebook, and Kafka
  • Assignment #1

Day 2

  • Streaming: Spark, Kafka, and Cassandra
  • Data analysis and external libraries
  • Assignment #2

Day 3

  • Microservices, cluster management, job orchestration, and live demo of end-to-end distributed pipeline
  • Final discussion & wrap up

Your Instructors

  • Andy Petrella

    Andy is an entrepreneur with a Mathematics and Distributed Data background focused on unleashing unexploited business potentials leveraging new technologies in machine learning, artificial intelligence, and cognitive systems.

    In the data community, Andy is known as an early evangelist of Apache Spark (2011-), the Spark Notebook creator (2013-), a public speaker at various events (Spark Summit, Strata, Big Data Spain), and an O'Reilly author (Distributed Data Science, Data Lineage Essentials, Data Governance, and Machine Learning Model Monitoring).

    Andy is the CEO of Kensu, bringing the Data Intelligence Management (DIM) Platform for data-driven companies to leverage AI sustainably, combining AI Observability with Data Usage Catalog.

  • Xavier Tordoir

    Xavier Tordoir started his career as a researcher in experimental physics, focused on data processing. He took part in projects in finance, genomics, and software development for academic research, working on time series, prediction of biological molecular structures and interactions, and applied machine learning methodologies. He developed solutions to manage and process data distributed across data centers.

    Xavier founded and works at Data Fellas, a company dedicated to distributed computing and advanced analytics, leveraging Scala, Spark, and other distributed technologies.

Start your free 10-day trial

Get started

Want to learn more at events like these?

Get full access to O'Reilly online learning for 10 days—free.

  • checkmark50k+ videos, live online training, learning paths, books, and more.
  • checkmarkBuild playlists of content to share with friends and colleagues.
  • checkmarkLearn anywhere with our iOS and Android apps.
Start Free TrialNo credit card required.