book

Scaling Python with Dask

by Holden Karau, Mika Kimmins

July 2023

Intermediate to advanced

223 pages

5h 24m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
A Note on ResponsibilityConventions Used in This BookOnline FiguresLicenseUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. What Is Dask?
Why Do You Need Dask?Where Does Dask Fit in the Ecosystem?Big DataData ScienceParallel to Distributed PythonDask Community LibrariesWhat Dask Is NotConclusion
2. Getting Started with Dask
Installing Dask LocallyHello WorldsTask Hello WorldDistributed CollectionsDask DataFrame (Pandas/What People Wish Big Data Was)Conclusion
3. How Dask Works: The Basics
Execution BackendsLocal BackendsDistributed (Dask Client and Scheduler)Dask’s Diagnostics User InterfaceSerialization and PicklingPartitioning/Chunking CollectionsDask ArraysDask BagsDask DataFramesShufflesPartitions During LoadTasks, Graphs, and Lazy EvaluationLazy EvaluationTask DependenciesvisualizeIntermediate Task ResultsTask SizingWhen Task Graphs Get Too LargeCombining ComputationPersist, Caching, and MemoizationFault ToleranceConclusion
4. Dask DataFrame
How Dask DataFrames Are BuiltLoading and WritingFormatsFilesystemsIndexingShufflesRolling Windows and map_overlapAggregationsFull Shuffles and PartitioningEmbarrassingly Parallel OperationsWorking with Multiple DataFramesMulti-DataFrame InternalsMissing FunctionalityWhat Does Not WorkWhat’s SlowerHandling Recursive AlgorithmsRe-computed DataHow Other Functions Are DifferentData Science with Dask DataFrame: Putting It TogetherDeciding to Use DaskExploratory Data Analysis with DaskLoading DataPlotting DataInspecting DataConclusion
5. Dask’s Collections
Dask ArraysCommon Use CasesWhen Not to Use Dask ArraysLoading/SavingWhat’s MissingSpecial Dask FunctionsDask BagsCommon Use CasesLoading and Saving Dask BagsLoading Messy Data with a Dask BagLimitationsConclusion
6. Advanced Task Scheduling: Futures and Friends
Lazy and Eager Evaluation RevisitedUse Cases for FuturesLaunching FuturesFuture Life CycleFire-and-ForgetRetrieving ResultsNested FuturesConclusion
7. Adding Changeable/Mutable State with Dask Actors
What Is the Actor Model?Dask ActorsYour First Actor (It’s a Bank Account)Scaling Dask ActorsLimitationsWhen to Use Dask ActorsConclusion
8. How to Evaluate Dask’s Components and Libraries
Qualitative Considerations for Project EvaluationProject PrioritiesCommunityDask-Specific Best PracticesUp-to-Date DependenciesDocumentationOpenness to ContributionsExtensibilityQuantitative Metrics for Open Source Project EvaluationRelease HistoryCommit Frequency (and Volume)Library UsageCode and Best PracticesConclusion
9. Migrating Existing Analytic Engineering
Why Dask?Limitations of DaskMigration Road MapTypes of ClustersDevelopment: ConsiderationsDeployment MonitoringConclusion

10. Dask with GPUs and Other Special Resources
Transparent Versus Non-transparent AcceleratorsUnderstanding Whether GPUs or TPUs Can HelpMaking Dask Resource-AwareInstalling the LibrariesUsing Custom Resources Inside Your Dask TasksDecorators (Including Numba)GPUsGPU Acceleration Built on Top of DaskcuDFBlazingSQLcuStreamzFreeing Accelerator ResourcesDesign Patterns: CPU FallbackConclusion
11. Machine Learning with Dask
Parallelizing MLWhen to Use Dask-MLGetting Started with Dask-ML and XGBoostFeature EngineeringModel Selection and TrainingWhen There Is No Dask-ML EquivalentUse with Dask’s joblibXGBoost with DaskML Models with Dask-SQLInference and DeploymentDistributing Data and Models ManuallyLarge-Scale Inferences with DaskConclusion
12. Productionizing Dask: Notebooks, Deployment, Tuning, and Monitoring
Factors to Consider in a Deployment OptionBuilding Dask on a Kubernetes DeploymentDask on RayDask on YARNDask on High-Performance ComputingSetting Up Dask in a Remote ClusterConnecting a Local Machine to an HPC ClusterDask JupyterLab Extension and MagicsInstalling JupyterLab ExtensionsLaunching ClustersUIWatching ProgressUnderstanding Dask PerformanceMetrics in Distributed ComputingThe Dask DashboardSaving and Sharing Dask Metrics/Performance LogsAdvanced DiagnosticsScaling and Debugging Best PracticesManual ScalingAdaptive/Auto-scalingPersist and Delete Costly DataDask NannyWorker Memory ManagementCluster SizingChunking, RevisitedAvoid RechunkingScheduled JobsDeployment MonitoringConclusion
A. Key System Concepts for Dask Users
TestingManual TestingUnit TestingIntegration TestingTest-Driven DevelopmentProperty TestingWorking with NotebooksOut-of-Notebook TestingIn-Notebook Testing: In-Line AssertionsData and Output ValidationPeer-to-Peer Versus Centralized DistributedMethods of ParallelismTask ParallelismData ParallelismLoad BalancingNetwork Fault Tolerance and CAP TheoremRecursion (Tail and Otherwise)Versioning and Branching: Code and DataIsolation and Noisy NeighborsMachine Fault ToleranceScalability (Up and Down)Cache, Memory, Disk, and Networking: How the Performance ChangesHashingData LocalityExactly Once Versus At Least OnceConclusion
B. Scalable DataFrames: A Comparison and Some History
ToolsOne Machine OnlyDistributedConclusion
C. Debugging Dask
Using DebuggersGeneral Debugging Tips with DaskNative ErrorsSome Notes on Official Advice for Handling Bad RecordsDask DiagnosticsConclusion
D. Streaming with Streamz and Dask
Getting Started with Streamz on DaskStreaming Data Sources and SinksWord CountGPU Pipelines on Dask StreamingLimitations, Challenges, and WorkaroundsConclusion
Index
About the Authors

Content preview from Scaling Python with Dask

Chapter 5. Dask’s Collections

So far you’ve seen the basics of how Dask is built as well as how Dask uses these building blocks to support data science with DataFrames. This chapter explores where Dask’s bag and array interfaces—often overlooked, relative to DataFrames—are more appropriate. As mentioned in “Hello Worlds”, Dask bags implement common functional APIs, and Dask arrays implement a subset of NumPy arrays.

Tip

Understanding partitioning is important for understanding collections. If you skipped “Partitioning/Chunking Collections”, now is a good time to head back and take a look.

Dask Arrays

Dask arrays implement a subset of the NumPy ndarray interface, making them ideal for porting code that uses NumPy to run on Dask. Much of your understanding from the previous chapter with DataFrames carries over to Dask arrays, as well as much of your understanding of ndarrays.

Common Use Cases

Some common use cases for Dask arrays include:

Large-scale imaging and astronomy data
Weather data
Multi-dimensional data

Similar to Dask DataFrames and pandas, if you wouldn’t use an nparray for the problem at a smaller scale, a Dask array may not be the right solution.

When Not to Use Dask Arrays

If your data fits in memory on a single computer, using Dask arrays is unlikely to give you many benefits over nparrays, especially compared to local accelerators like Numba. Numba is well suited to vectorizing and parallelizing local tasks with and without Graphics Processing Units ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098119867Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills