Data Science with Big Datasets Using Dask, Ray, and Other Best-of-Breed Tools
Published by O'Reilly Media, Inc.
Effective data at scale with a new generation of open tools
As Apache Spark matures and moves towards legacy status, data scientists have new opportunities and challenges in dealing with large-scale data, without a single all-encompassing tool.
Join expert Adam Breindel to get hands-on practice with a suite of easier, more data-science-friendly tools that work for your team and let you build more future-proof systems. You’ll learn new tools for working with big datasets, like Dask and Ray; understand how to continue to make the best use of existing tools, like SparkSQL, Presto, and Kafka; and discover how to choose and combine the right tools for feature extraction, feature engineering, modeling, tuning, and other data science tasks.
APAC friendly time.
What you’ll learn and how you can apply it
By the end of this live online course, you’ll understand:
- The world of Python-friendly large-scale data science tools including Dask and Ray
- Areas where existing tools like SparkSQL, Presto, and Kafka are still critical
- How to combine these tools and their data formats into sensible systems
And you’ll be able to:
- Take on large-scale data science work with Python tools, even if your data is in Hive or Spark tables
- Choose the right tools for feature extraction, feature engineering, modeling, and tuning
- Cut through the hype and explain the strengths and weaknesses of key open source projects
This live event is for you because...
- You’re a data scientist, team lead, manager, or architect in charge of data science projects at scale.
- You’ve relied on Apache Spark or home-built solutions in the past and would like to learn or migrate to newer best-of-breed tools.
- You’re already using a Python big data tool like Dask or Ray, and you’d like to fill in the gaps to cover more use cases.
Prerequisites
- A basic understanding of the data science lifecycle in enterprise settings
- Familiarity with the challenges of big data systems (useful but not required)
Recommended preparation:
- Read relevant sections of Kafka: The Definitive Guide, second edition and Presto: The Definitive Guide to close up any gaps in your knowledge
- Think about the challenges you’ve faced with data science at scale and be ready to discuss any question you have in class
Recommended follow-up:
- Take Kafka Fundamentals (live online training course with Petter Graff)
- Watch Meet the Expert: Dean Wampler on Scaling ML/AI Applications with Ray (video, 57m)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Welcome to large-scale data science in 2020! What’s changed? (15 minutes)
- Presentation and group discussion: Data science in 2016—What to do with data? R, Python, and Spark; 2017—the broad rise of deep learning; 2018–2019—the decline of Hadoop/Spark for data science; 2020—new open tools and hybrid architectures
Interactive survey: Current architectures (10 minutes)
- Presentation and discussion: Dataset size; data storage; SQL and extraction; feature engineering and modeling
Changing definition of large-scale data (10 minutes)
- Presentation and discussion: The largest dataset for ML versus the largest dataset tractable on a single node; new definitions for small, medium, and “big” data
Obtaining data (feature extraction) (35 minutes)
- Presentation and discussion: Legacy tools (e.g., Hive); current tools (e.g., SparkSQL, Presto); future tools (e.g., Kartothek, BlazingSQL); easiest compromises for working in the real world; working without SQL?
- Hands-on exercise: Try SparkSQL
- Q&A
Feature engineering (40 minutes)
- Presentation and demos: The DataFrame model; from pandas to Dask; the array model; Dask Array
- Hands-on exercise: DataFrame and Array at scale
- Q&A
Break (10 minutes)
Modeling (55 minutes)
- Presentation and demos: Unsupervised learning and dimensionality reduction with Dask, Ray, and Horovod; classic ML with Dask, XGBoost, and GPU acceleration; deep learning with RaySGD, Ray RLlib, or Horovod; simulations and agent-based models with Ray
- Hands-on exercise: Modeling with Dask and Ray
- Q&A
Break (10 minutes)
Model tuning (25 minutes)
- Presentation and demos: RayTune—easy access to state-of-the-art scale-out tuning; tuning options and lite AutoML with Dask
- Hands-on exercise: RayTune
Model scoring, online learning, and orchestration (30 minutes)
- Presentation and demos: Batch, request/response, and streaming scoring; cached scoring; real versus “near real” online learning; Airflow and Prefect for orchestration
- Q&A
Your Instructor
Adam Breindel
Adam Breindel consults and teaches courses on Apache Spark, data engineering, machine learning, AI, and deep learning. He supports instructional initiatives as a senior instructor at Databricks, has taught classes on Apache Spark and deep learning for O'Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s first full-time job in tech was neural net–based fraud detection, deployed at North America's largest banks back; since then, he's worked with numerous startups, where he’s enjoyed getting to build things like mobile check-in for two of America's five biggest airlines years before the iPhone came out. He’s also worked in entertainment, insurance, and retail banking; on web, embedded, and server apps; and on clustering architectures, APIs, and streaming analytics.
Skills covered
- Data Science
- Dask