book

Scaling Python with Dask

by Holden Karau, Mika Kimmins

July 2023

Intermediate to advanced

223 pages

5h 24m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
A Note on ResponsibilityConventions Used in This BookOnline FiguresLicenseUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. What Is Dask?
Why Do You Need Dask?Where Does Dask Fit in the Ecosystem?Big DataData ScienceParallel to Distributed PythonDask Community LibrariesWhat Dask Is NotConclusion
2. Getting Started with Dask
Installing Dask LocallyHello WorldsTask Hello WorldDistributed CollectionsDask DataFrame (Pandas/What People Wish Big Data Was)Conclusion
3. How Dask Works: The Basics
Execution BackendsLocal BackendsDistributed (Dask Client and Scheduler)Dask’s Diagnostics User InterfaceSerialization and PicklingPartitioning/Chunking CollectionsDask ArraysDask BagsDask DataFramesShufflesPartitions During LoadTasks, Graphs, and Lazy EvaluationLazy EvaluationTask DependenciesvisualizeIntermediate Task ResultsTask SizingWhen Task Graphs Get Too LargeCombining ComputationPersist, Caching, and MemoizationFault ToleranceConclusion
4. Dask DataFrame
How Dask DataFrames Are BuiltLoading and WritingFormatsFilesystemsIndexingShufflesRolling Windows and map_overlapAggregationsFull Shuffles and PartitioningEmbarrassingly Parallel OperationsWorking with Multiple DataFramesMulti-DataFrame InternalsMissing FunctionalityWhat Does Not WorkWhat’s SlowerHandling Recursive AlgorithmsRe-computed DataHow Other Functions Are DifferentData Science with Dask DataFrame: Putting It TogetherDeciding to Use DaskExploratory Data Analysis with DaskLoading DataPlotting DataInspecting DataConclusion
5. Dask’s Collections
Dask ArraysCommon Use CasesWhen Not to Use Dask ArraysLoading/SavingWhat’s MissingSpecial Dask FunctionsDask BagsCommon Use CasesLoading and Saving Dask BagsLoading Messy Data with a Dask BagLimitationsConclusion
6. Advanced Task Scheduling: Futures and Friends
Lazy and Eager Evaluation RevisitedUse Cases for FuturesLaunching FuturesFuture Life CycleFire-and-ForgetRetrieving ResultsNested FuturesConclusion
7. Adding Changeable/Mutable State with Dask Actors
What Is the Actor Model?Dask ActorsYour First Actor (It’s a Bank Account)Scaling Dask ActorsLimitationsWhen to Use Dask ActorsConclusion
8. How to Evaluate Dask’s Components and Libraries
Qualitative Considerations for Project EvaluationProject PrioritiesCommunityDask-Specific Best PracticesUp-to-Date DependenciesDocumentationOpenness to ContributionsExtensibilityQuantitative Metrics for Open Source Project EvaluationRelease HistoryCommit Frequency (and Volume)Library UsageCode and Best PracticesConclusion
9. Migrating Existing Analytic Engineering
Why Dask?Limitations of DaskMigration Road MapTypes of ClustersDevelopment: ConsiderationsDeployment MonitoringConclusion

10. Dask with GPUs and Other Special Resources
Transparent Versus Non-transparent AcceleratorsUnderstanding Whether GPUs or TPUs Can HelpMaking Dask Resource-AwareInstalling the LibrariesUsing Custom Resources Inside Your Dask TasksDecorators (Including Numba)GPUsGPU Acceleration Built on Top of DaskcuDFBlazingSQLcuStreamzFreeing Accelerator ResourcesDesign Patterns: CPU FallbackConclusion
11. Machine Learning with Dask
Parallelizing MLWhen to Use Dask-MLGetting Started with Dask-ML and XGBoostFeature EngineeringModel Selection and TrainingWhen There Is No Dask-ML EquivalentUse with Dask’s joblibXGBoost with DaskML Models with Dask-SQLInference and DeploymentDistributing Data and Models ManuallyLarge-Scale Inferences with DaskConclusion
12. Productionizing Dask: Notebooks, Deployment, Tuning, and Monitoring
Factors to Consider in a Deployment OptionBuilding Dask on a Kubernetes DeploymentDask on RayDask on YARNDask on High-Performance ComputingSetting Up Dask in a Remote ClusterConnecting a Local Machine to an HPC ClusterDask JupyterLab Extension and MagicsInstalling JupyterLab ExtensionsLaunching ClustersUIWatching ProgressUnderstanding Dask PerformanceMetrics in Distributed ComputingThe Dask DashboardSaving and Sharing Dask Metrics/Performance LogsAdvanced DiagnosticsScaling and Debugging Best PracticesManual ScalingAdaptive/Auto-scalingPersist and Delete Costly DataDask NannyWorker Memory ManagementCluster SizingChunking, RevisitedAvoid RechunkingScheduled JobsDeployment MonitoringConclusion
A. Key System Concepts for Dask Users
TestingManual TestingUnit TestingIntegration TestingTest-Driven DevelopmentProperty TestingWorking with NotebooksOut-of-Notebook TestingIn-Notebook Testing: In-Line AssertionsData and Output ValidationPeer-to-Peer Versus Centralized DistributedMethods of ParallelismTask ParallelismData ParallelismLoad BalancingNetwork Fault Tolerance and CAP TheoremRecursion (Tail and Otherwise)Versioning and Branching: Code and DataIsolation and Noisy NeighborsMachine Fault ToleranceScalability (Up and Down)Cache, Memory, Disk, and Networking: How the Performance ChangesHashingData LocalityExactly Once Versus At Least OnceConclusion
B. Scalable DataFrames: A Comparison and Some History
ToolsOne Machine OnlyDistributedConclusion
C. Debugging Dask
Using DebuggersGeneral Debugging Tips with DaskNative ErrorsSome Notes on Official Advice for Handling Bad RecordsDask DiagnosticsConclusion
D. Streaming with Streamz and Dask
Getting Started with Streamz on DaskStreaming Data Sources and SinksWord CountGPU Pipelines on Dask StreamingLimitations, Challenges, and WorkaroundsConclusion
Index
About the Authors

Content preview from Scaling Python with Dask

Chapter 10. Dask with GPUs and Other Special Resources

Sometimes the answer to our scaling problem isn’t throwing more computers at it; it’s throwing different types of resources at it. One example of this might be ten thousand monkeys trying to reproduce the works of Shakespeare, versus one Shakespeare.¹ While performance varies, some benchmarks have shown up to an 85% improvement in model training times when using GPUs over CPUs. Continuing its modular tradition, the GPU logic of Dask is found in the libraries and ecosystem surrounding it. The libraries can either run on a collection of GPU workers or parallelize work over different GPUs on one host.

Most work we do on the computer is done on the CPU. GPUs were created for displaying video but involve doing large amounts of vectorized floating point (e.g., non-integer) operations. With vectorized operations, the same operation is applied in parallel on large sets of data, like a map. Tensor Processing Units (TPUs) are similar to GPUs, except without also being used for graphics.

For our purposes, in Dask, we can think of GPUs and TPUs as specializing in offloading large vectorized computations, but there are many other kinds of accelerators. While much of this chapter is focused on GPUs, the same general techniques, albeit with different libraries, generally apply to other accelerators. Other kinds of specialized resources include NVMe drives, faster (or larger) RAM, TCP/IP offload, Just-a-Bunch-of-Disks expansion ports, and ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098119867Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Scaling Python with Dask

by Holden Karau, Mika Kimmins

Chapter 10. Dask with GPUs and Other Special Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.