O'Reilly logo
live online training icon Live Online training

Scale your Python processing with Dask

enter image description here

Crunch big data easily in Python from a few cores to a few thousand machines

Adam Breindel

Python is maybe the preeminent language for data science. And the SciPy ecosystem enables hundreds of different use cases from astronomy to financial time series analysis to natural language processing. Most Python tools assume your data fits in memory, and many don’t support parallel execution, but today we have much more data and much more compute power. It’s time to scale open source Python tools to huge datasets and huge compute clusters.

Expert Adam Breindel takes a deep dive into the open source Dask project, which supports scaling the Python data ecosystem in a straightforward and understandable way and works well on anything from single laptops to thousand-machine clusters. You can use Dask to scale pandas DataFrames, scikit-learn ML, NumPy tensor operations, and more, as well as implement lower-level, custom task scheduling for more unusual algorithms. Dask plays nice with all the toys you want—Kubernetes for scaling, GPUs for acceleration, Parquet for data ingestion, and Datashader for visualization.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • What Dask is and why it exists
  • How Dask fits into the Python and big data landscape
  • How Dask can help you process more data faster

And you’ll be able to:

  • Begin building systems with Dask
  • Add Dask and start incrementally migrating existing components
  • Analyze data and train ML models with Dask

This training course is for you because...

  • You’re a data engineer, data scientist, or natural or social scientist.
  • You work with Python and data.
  • You want to become a practitioner or leader who focuses on pragmatic, effective solutions.

Prerequisites

  • A basic understanding of Python and the Python data science stack (pandas, NumPy, and scikit-learn)

Recommended preparation:

Recommended follow-up:

About your instructor

  • Adam Breindel consults and teaches courses on Apache Spark, data engineering, machine learning, AI, and deep learning. He supports instructional initiatives as a senior instructor at Databricks, has taught classes on Apache Spark and deep learning for O'Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s first full-time job in tech was neural net–based fraud detection, deployed at North America's largest banks back; since then, he's worked with numerous startups, where he’s enjoyed getting to build things like mobile check-in for two of America's five biggest airlines years before the iPhone came out. He’s also worked in entertainment, insurance, and retail banking; on web, embedded, and server apps; and on clustering architectures, APIs, and streaming analytics.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction (55 minutes)

  • Lecture: What Dask is, where it’s from, and what problems it solves; pandas-style analytics with pandas and Dask DataFrames
  • Group discussion: Setting up and deploying Dask
  • Hands-on exercise: Complete an analytics exercise
  • Q&A

Break (5 minutes)

Dask graphical user interfaces (30 minutes)

  • Lecture: Monitoring workers, tasks, and memory; using Dask’s built-in profiling to understand performance
  • Group discussion: The biggest performance and troubleshooting challenges with big data
  • Hands-on exercise: Analyze the performance of data transformation
  • Q&A

Machine learning (25 minutes)

  • Lecture: Modeling task; scikit-learn-style featurization with Dask
  • Group discussion: Current algorithm support and integration
  • Hands-on exercise: Try an alternate model
  • Q&A

Break (5 minutes)

Additional data structure overview (25 minutes)

  • Lecture: Dask array; Dask Bag
  • Group discussion: What can we do with a Dask array?
  • Hands-on exercise: Look at lower-level task graph opportunities in the docs
  • Q&A

Best practices (20 minutes)

  • Lecture: Managing partitions and tasks; caching
  • Group discussion: File formats and data structures

Wrap-up and Q&A (15 minutes)