O'Reilly logo
live online training icon Live Online training

Techniques for Data Science with Big Datasets

enter image description here

Effective data at scale with a new generation of open tools

Topic: Data
Adam Breindel

As Apache Spark matures and moves towards legacy status, data scientists have new opportunities and challenges in dealing with large-scale data, without a single all-encompassing tool.

Join expert Adam Breindel to get hands-on practice with a suite of easier, more data-science-friendly tools that work for your team and let you build more future-proof systems. You’ll learn new tools for working with big datasets, like Dask and Ray; understand how to continue to make the best use of existing tools, like SparkSQL, Presto, and Kafka; and discover how to choose and combine the right tools for feature extraction, feature engineering, modeling, tuning, and other data science tasks.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • The world of Python-friendly large-scale data science tools including Dask and Ray
  • Areas where existing tools like SparkSQL, Presto, and Kafka are still critical
  • How to combine these tools and their data formats into sensible systems

And you’ll be able to:

  • Take on large-scale data science work with Python tools, even if your data is in Hive or Spark tables
  • Choose the right tools for feature extraction, feature engineering, modeling, and tuning
  • Cut through the hype and explain the strengths and weaknesses of key open source projects

This training course is for you because...

  • You’re a data scientist, team lead, manager, or architect in charge of data science projects at scale.
  • You’ve relied on Apache Spark or home-built solutions in the past and would like to learn or migrate to newer best-of-breed tools.
  • You’re already using a Python big data tool like Dask or Ray, and you’d like to fill in the gaps to cover more use cases.


  • A basic understanding of the data science lifecycle in enterprise settings
  • Familiarity with the challenges of big data systems (useful but not required)

Recommended preparation:

Recommended follow-up:

About your instructor

  • Adam Breindel consults and teaches courses on Apache Spark, data engineering, machine learning, AI, and deep learning. He supports instructional initiatives as a senior instructor at Databricks, has taught classes on Apache Spark and deep learning for O'Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s first full-time job in tech was neural net–based fraud detection, deployed at North America's largest banks back; since then, he's worked with numerous startups, where he’s enjoyed getting to build things like mobile check-in for two of America's five biggest airlines years before the iPhone came out. He’s also worked in entertainment, insurance, and retail banking; on web, embedded, and server apps; and on clustering architectures, APIs, and streaming analytics.


The timeframes are only estimates and may vary according to how the class is progressing

Welcome to large-scale data science in 2020! What’s changed? (15 minutes)

  • Presentation and group discussion: Data science in 2016—What to do with data? R, Python, and Spark; 2017—the broad rise of deep learning; 2018–2019—the decline of Hadoop/Spark for data science; 2020—new open tools and hybrid architectures

Interactive survey: Current architectures (10 minutes)

  • Presentation and discussion: Dataset size; data storage; SQL and extraction; feature engineering and modeling

Changing definition of large-scale data (10 minutes)

  • Presentation and discussion: The largest dataset for ML versus the largest dataset tractable on a single node; new definitions for small, medium, and “big” data

Obtaining data (feature extraction) (35 minutes)

  • Presentation and discussion: Legacy tools (e.g., Hive); current tools (e.g., SparkSQL, Presto); future tools (e.g., Kartothek, BlazingSQL); easiest compromises for working in the real world; working without SQL?
  • Hands-on exercise: Try SparkSQL
  • Q&A

Feature engineering (40 minutes)

  • Presentation and demos: The DataFrame model; from pandas to Dask; the array model; Dask Array
  • Hands-on exercise: DataFrame and Array at scale
  • Q&A

Break (10 minutes)

Modeling (55 minutes)

  • Presentation and demos: Unsupervised learning and dimensionality reduction with Dask, Ray, and Horovod; classic ML with Dask, XGBoost, and GPU acceleration; deep learning with RaySGD, Ray RLlib, or Horovod; simulations and agent-based models with Ray
  • Hands-on exercise: Modeling with Dask and Ray
  • Q&A

Break (10 minutes)

Model tuning (25 minutes)

  • Presentation and demos: RayTune—easy access to state-of-the-art scale-out tuning; tuning options and lite AutoML with Dask
  • Hands-on exercise: RayTune

Model scoring, online learning, and orchestration (30 minutes)

  • Presentation and demos: Batch, request/response, and streaming scoring; cached scoring; real versus “near real” online learning; Airflow and Prefect for orchestration
  • Q&A