Scaling Python with Dask

Book description

Modern systems contain multi-core CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source library for parallel computing provides APIs that make it easy to parallelize PyData libraries including NumPy, pandas, and scikit-learn.

Authors Holden Karau and Mika Kimmins show you how to use Dask computations in local systems and then scale to the cloud for heavier workloads. This practical book explains why Dask is popular among industry experts and academics and is used by organizations that include Walmart, Capital One, Harvard Medical School, and NASA.

With this book, you'll learn:

  • What Dask is, where you can use it, and how it compares with other tools
  • How to use Dask for batch data parallel processing
  • Key distributed system concepts for working with Dask
  • Methods for using Dask with higher-level APIs and building blocks
  • How to work with integrated libraries such as scikit-learn, pandas, and PyTorch
  • How to use Dask with GPUs

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. A Note on Responsibility
    2. Conventions Used in This Book
    3. Online Figures
    4. License
    5. Using Code Examples
    6. O’Reilly Online Learning
    7. How to Contact Us
    8. Acknowledgments
  2. 1. What Is Dask?
    1. Why Do You Need Dask?
    2. Where Does Dask Fit in the Ecosystem?
      1. Big Data
      2. Data Science
      3. Parallel to Distributed Python
      4. Dask Community Libraries
    3. What Dask Is Not
    4. Conclusion
  3. 2. Getting Started with Dask
    1. Installing Dask Locally
    2. Hello Worlds
      1. Task Hello World
      2. Distributed Collections
      3. Dask DataFrame (Pandas/What People Wish Big Data Was)
    3. Conclusion
  4. 3. How Dask Works: The Basics
    1. Execution Backends
      1. Local Backends
      2. Distributed (Dask Client and Scheduler)
    2. Dask’s Diagnostics User Interface
    3. Serialization and Pickling
    4. Partitioning/Chunking Collections
      1. Dask Arrays
      2. Dask Bags
      3. Dask DataFrames
      4. Shuffles
      5. Partitions During Load
    5. Tasks, Graphs, and Lazy Evaluation
      1. Lazy Evaluation
      2. Task Dependencies
      3. visualize
      4. Intermediate Task Results
      5. Task Sizing
      6. When Task Graphs Get Too Large
      7. Combining Computation
      8. Persist, Caching, and Memoization
    6. Fault Tolerance
    7. Conclusion
  5. 4. Dask DataFrame
    1. How Dask DataFrames Are Built
    2. Loading and Writing
      1. Formats
      2. Filesystems
    3. Indexing
    4. Shuffles
      1. Rolling Windows and map_overlap
      2. Aggregations
      3. Full Shuffles and Partitioning
    5. Embarrassingly Parallel Operations
    6. Working with Multiple DataFrames
      1. Multi-DataFrame Internals
      2. Missing Functionality
    7. What Does Not Work
    8. What’s Slower
    9. Handling Recursive Algorithms
    10. Re-computed Data
    11. How Other Functions Are Different
    12. Data Science with Dask DataFrame: Putting It Together
      1. Deciding to Use Dask
      2. Exploratory Data Analysis with Dask
      3. Loading Data
      4. Plotting Data
      5. Inspecting Data
    13. Conclusion
  6. 5. Dask’s Collections
    1. Dask Arrays
      1. Common Use Cases
      2. When Not to Use Dask Arrays
      3. Loading/Saving
      4. What’s Missing
      5. Special Dask Functions
    2. Dask Bags
      1. Common Use Cases
      2. Loading and Saving Dask Bags
      3. Loading Messy Data with a Dask Bag
      4. Limitations
    3. Conclusion
  7. 6. Advanced Task Scheduling: Futures and Friends
    1. Lazy and Eager Evaluation Revisited
    2. Use Cases for Futures
    3. Launching Futures
    4. Future Life Cycle
    5. Fire-and-Forget
    6. Retrieving Results
    7. Nested Futures
    8. Conclusion
  8. 7. Adding Changeable/Mutable State with Dask Actors
    1. What Is the Actor Model?
    2. Dask Actors
      1. Your First Actor (It’s a Bank Account)
      2. Scaling Dask Actors
      3. Limitations
    3. When to Use Dask Actors
    4. Conclusion
  9. 8. How to Evaluate Dask’s Components and Libraries
    1. Qualitative Considerations for Project Evaluation
      1. Project Priorities
      2. Community
      3. Dask-Specific Best Practices
      4. Up-to-Date Dependencies
      5. Documentation
      6. Openness to Contributions
      7. Extensibility
    2. Quantitative Metrics for Open Source Project Evaluation
      1. Release History
      2. Commit Frequency (and Volume)
      3. Library Usage
      4. Code and Best Practices
    3. Conclusion
  10. 9. Migrating Existing Analytic Engineering
    1. Why Dask?
    2. Limitations of Dask
    3. Migration Road Map
      1. Types of Clusters
      2. Development: Considerations
      3. Deployment Monitoring
    4. Conclusion
  11. 10. Dask with GPUs and Other Special Resources
    1. Transparent Versus Non-transparent Accelerators
    2. Understanding Whether GPUs or TPUs Can Help
    3. Making Dask Resource-Aware
    4. Installing the Libraries
    5. Using Custom Resources Inside Your Dask Tasks
      1. Decorators (Including Numba)
      2. GPUs
    6. GPU Acceleration Built on Top of Dask
      1. cuDF
      2. BlazingSQL
      3. cuStreamz
    7. Freeing Accelerator Resources
    8. Design Patterns: CPU Fallback
    9. Conclusion
  12. 11. Machine Learning with Dask
    1. Parallelizing ML
    2. When to Use Dask-ML
    3. Getting Started with Dask-ML and XGBoost
      1. Feature Engineering
      2. Model Selection and Training
      3. When There Is No Dask-ML Equivalent
      4. Use with Dask’s joblib
      5. XGBoost with Dask
    4. ML Models with Dask-SQL
    5. Inference and Deployment
      1. Distributing Data and Models Manually
      2. Large-Scale Inferences with Dask
    6. Conclusion
  13. 12. Productionizing Dask: Notebooks, Deployment, Tuning, and Monitoring
    1. Factors to Consider in a Deployment Option
    2. Building Dask on a Kubernetes Deployment
    3. Dask on Ray
    4. Dask on YARN
    5. Dask on High-Performance Computing
      1. Setting Up Dask in a Remote Cluster
      2. Connecting a Local Machine to an HPC Cluster
    6. Dask JupyterLab Extension and Magics
      1. Installing JupyterLab Extensions
      2. Launching Clusters
      3. UI
      4. Watching Progress
    7. Understanding Dask Performance
      1. Metrics in Distributed Computing
      2. The Dask Dashboard
      3. Saving and Sharing Dask Metrics/Performance Logs
      4. Advanced Diagnostics
    8. Scaling and Debugging Best Practices
      1. Manual Scaling
      2. Adaptive/Auto-scaling
      3. Persist and Delete Costly Data
      4. Dask Nanny
      5. Worker Memory Management
      6. Cluster Sizing
      7. Chunking, Revisited
      8. Avoid Rechunking
    9. Scheduled Jobs
    10. Deployment Monitoring
    11. Conclusion
  14. A. Key System Concepts for Dask Users
    1. Testing
      1. Manual Testing
      2. Unit Testing
      3. Integration Testing
      4. Test-Driven Development
      5. Property Testing
      6. Working with Notebooks
      7. Out-of-Notebook Testing
      8. In-Notebook Testing: In-Line Assertions
    2. Data and Output Validation
    3. Peer-to-Peer Versus Centralized Distributed
    4. Methods of Parallelism
      1. Task Parallelism
      2. Data Parallelism
      3. Load Balancing
    5. Network Fault Tolerance and CAP Theorem
    6. Recursion (Tail and Otherwise)
    7. Versioning and Branching: Code and Data
    8. Isolation and Noisy Neighbors
    9. Machine Fault Tolerance
    10. Scalability (Up and Down)
    11. Cache, Memory, Disk, and Networking: How the Performance Changes
    12. Hashing
    13. Data Locality
    14. Exactly Once Versus At Least Once
    15. Conclusion
  15. B. Scalable DataFrames: A Comparison and Some History
    1. Tools
      1. One Machine Only
      2. Distributed
    2. Conclusion
  16. C. Debugging Dask
    1. Using Debuggers
    2. General Debugging Tips with Dask
    3. Native Errors
    4. Some Notes on Official Advice for Handling Bad Records
    5. Dask Diagnostics
    6. Conclusion
  17. D. Streaming with Streamz and Dask
    1. Getting Started with Streamz on Dask
    2. Streaming Data Sources and Sinks
    3. Word Count
    4. GPU Pipelines on Dask Streaming
    5. Limitations, Challenges, and Workarounds
    6. Conclusion
  18. Index
  19. About the Authors

Product information

  • Title: Scaling Python with Dask
  • Author(s): Holden Karau, Mika Kimmins
  • Release date: July 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098119874