Scaling Python with Dask

Book description

Modern systems contain multicore CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source library for parallel computing provides APIs that make it easy to parallelize PyData libraries including NumPy, pandas, and scikit-learn.

Authors Holden Karau and Mika Kimmins show you how to use Dask computations in local systems and then scale to the cloud for heavier workloads. This practical book explains why Dask is popular among industry experts and academics and is used by organizations that include Walmart, Capital One, Harvard Medical School, and NASA.

With this book, you'll learn:

  • What Dask is, where you can use it, and how it compares with other tools
  • How to use Dask for batch data parallel processing
  • Key distributed system concepts for working with Dask
  • Methods for using Dask with higher-level APIs and building blocks
  • How to work with integrated libraries such as scikit-learn, pandas, and PyTorch
  • How to use Dask with GPUs

Publisher resources

View/Submit Errata

Table of contents

  1. 1. What Is Dask?
    1. Why Do You Need Dask?
    2. Where Does Dask Fit in the Ecosystem?
      1. Big Data
      2. Data Science
      3. Parallel to Distributed Python
      4. Dask Community Libraries
    3. What Dask Is Not
    4. Conclusion
  2. 2. Getting Started with Dask
    1. Installing Dask Locally
    2. Hello Worlds
      1. Task Hello World
      2. Distributed Collections
      3. Dask DataFrame (Pandas / What People Wish Big Data Was)
    3. Conclusion
  3. 3. Understanding enough of how Dask works
    1. Execution Backends
      1. Local Backends
      2. Distributed (Dask Client and Scheduler)
    2. Dask’s Diagnostics User Interface
    3. Serialization and Pickling
    4. Partitioning / Chunking Collections
      1. Shuffles
      2. Partitions During Load
    5. Tasks, Graphs, and Lazy Evaluation
      1. Lazy Evaluation
      2. Visualize
      3. Intermediate Task Results
      4. Task Sizing
      5. Too large Task Graphs
      6. Combining Computation
      7. Persist, Caching & Memoization
    6. Fault Tolerance
    7. Conclusion
  4. 4. Dask DataFrame
    1. How Dask DataFrames Are Built
    2. Loading and Writing
      1. Formats
      2. File Systems
    3. Indexing
    4. Shuffles
      1. Rolling windows and map_overlap
      2. Aggregations
      3. Full shuffles and partitioning
    5. Embarrassingly Parallel Operations
    6. Working with Multiple DataFrames
      1. Multi-DataFrame Internals
      2. Missing functionality
    7. What Does Not Work
    8. What’s Slower
    9. Handling Recursive Algorithms
    10. What other functions are different
    11. Data Science with Dask dataframe: putting it together
      1. Deciding to use Dask
      2. Exploratory Data Analysis with Dask
      3. Loading Data
      4. Plotting data
      5. Inspecting data
    12. Conclusion
  5. 5. Dask’s Collections
    1. Dask Arrays
      1. Common Use Cases
      2. Times to not use Dask Arrays
      3. Loading / Saving
      4. What’s Missing
      5. Special Dask Functions
    2. Dask Bags
      1. Common Use Cases
      2. Loading and Saving Dask Bags
      3. Loading Messy Data With a Dask Bag
      4. Limitations
    3. Conclusion
  6. 6. Advanced Task Scheduling: Futures and Friends
    1. Lazy and Eager Evaluation Revisited
    2. Use cases for futures
    3. Launching Futures
    4. Future Lifecycle
    5. Fire and Forget
    6. Retrieving Results
    7. Nested Futures
    8. Distributed Data Structures for Scheduling
    9. Conclusion
  7. 7. Adding Changeable/Mutable State with Dask Actors
    1. What is the Actor Model?
    2. Dask Actors
      1. Your First Actor (it’s a bank account)
      2. Scaling Dask actors
      3. Limitations
    3. When to use Dask Actors
    4. Conclusion
  8. About the Authors

Product information

  • Title: Scaling Python with Dask
  • Author(s): Holden Karau, Mika Kimmins
  • Release date: July 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098119874