Skip to Content
View all events

Data Science with Big Datasets Using Dask, Ray, and Other Best-of-Breed Tools

Published by O'Reilly Media, Inc.

Intermediate content levelIntermediate

Effective data at scale with a new generation of open tools

This live event utilizes Jupyter Notebook technology

As Apache Spark matures and moves towards legacy status, data scientists have new opportunities and challenges in dealing with large-scale data, without a single all-encompassing tool.

Join expert Adam Breindel to get hands-on practice with a suite of easier, more data-science-friendly tools that work for your team and let you build more future-proof systems. You’ll learn new tools for working with big datasets, like Dask and Ray; understand how to continue to make the best use of existing tools, like SparkSQL, Presto, and Kafka; and discover how to choose and combine the right tools for feature extraction, feature engineering, modeling, tuning, and other data science tasks.

APAC friendly time.

What you’ll learn and how you can apply it

By the end of this live online course, you’ll understand:

  • The world of Python-friendly large-scale data science tools including Dask and Ray
  • Areas where existing tools like SparkSQL, Presto, and Kafka are still critical
  • How to combine these tools and their data formats into sensible systems

And you’ll be able to:

  • Take on large-scale data science work with Python tools, even if your data is in Hive or Spark tables
  • Choose the right tools for feature extraction, feature engineering, modeling, and tuning
  • Cut through the hype and explain the strengths and weaknesses of key open source projects

This live event is for you because...

  • You’re a data scientist, team lead, manager, or architect in charge of data science projects at scale.
  • You’ve relied on Apache Spark or home-built solutions in the past and would like to learn or migrate to newer best-of-breed tools.
  • You’re already using a Python big data tool like Dask or Ray, and you’d like to fill in the gaps to cover more use cases.

Prerequisites

  • A basic understanding of the data science lifecycle in enterprise settings
  • Familiarity with the challenges of big data systems (useful but not required)

Recommended preparation:

Recommended follow-up:

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Welcome to large-scale data science in 2020! What’s changed? (15 minutes)

  • Presentation and group discussion: Data science in 2016—What to do with data? R, Python, and Spark; 2017—the broad rise of deep learning; 2018–2019—the decline of Hadoop/Spark for data science; 2020—new open tools and hybrid architectures

Interactive survey: Current architectures (10 minutes)

  • Presentation and discussion: Dataset size; data storage; SQL and extraction; feature engineering and modeling

Changing definition of large-scale data (10 minutes)

  • Presentation and discussion: The largest dataset for ML versus the largest dataset tractable on a single node; new definitions for small, medium, and “big” data

Obtaining data (feature extraction) (35 minutes)

  • Presentation and discussion: Legacy tools (e.g., Hive); current tools (e.g., SparkSQL, Presto); future tools (e.g., Kartothek, BlazingSQL); easiest compromises for working in the real world; working without SQL?
  • Hands-on exercise: Try SparkSQL
  • Q&A

Feature engineering (40 minutes)

  • Presentation and demos: The DataFrame model; from pandas to Dask; the array model; Dask Array
  • Hands-on exercise: DataFrame and Array at scale
  • Q&A

Break (10 minutes)

Modeling (55 minutes)

  • Presentation and demos: Unsupervised learning and dimensionality reduction with Dask, Ray, and Horovod; classic ML with Dask, XGBoost, and GPU acceleration; deep learning with RaySGD, Ray RLlib, or Horovod; simulations and agent-based models with Ray
  • Hands-on exercise: Modeling with Dask and Ray
  • Q&A

Break (10 minutes)

Model tuning (25 minutes)

  • Presentation and demos: RayTune—easy access to state-of-the-art scale-out tuning; tuning options and lite AutoML with Dask
  • Hands-on exercise: RayTune

Model scoring, online learning, and orchestration (30 minutes)

  • Presentation and demos: Batch, request/response, and streaming scoring; cached scoring; real versus “near real” online learning; Airflow and Prefect for orchestration
  • Q&A

Your Instructor

  • Adam Breindel

    Adam Breindel consults and teaches courses on Apache Spark, data engineering, machine learning, AI, and deep learning. He supports instructional initiatives as a senior instructor at Databricks, has taught classes on Apache Spark and deep learning for O'Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s first full-time job in tech was neural net–based fraud detection, deployed at North America's largest banks back; since then, he's worked with numerous startups, where he’s enjoyed getting to build things like mobile check-in for two of America's five biggest airlines years before the iPhone came out. He’s also worked in entertainment, insurance, and retail banking; on web, embedded, and server apps; and on clustering architectures, APIs, and streaming analytics.

    linkedinXsearch

Skills covered

  • Data Science
  • Dask