book

Scaling Python with Dask

by Holden Karau, Mika Kimmins

July 2023

Intermediate to advanced

223 pages

5h 24m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
A Note on ResponsibilityConventions Used in This BookOnline FiguresLicenseUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. What Is Dask?
Why Do You Need Dask?Where Does Dask Fit in the Ecosystem?Big DataData ScienceParallel to Distributed PythonDask Community LibrariesWhat Dask Is NotConclusion
2. Getting Started with Dask
Installing Dask LocallyHello WorldsTask Hello WorldDistributed CollectionsDask DataFrame (Pandas/What People Wish Big Data Was)Conclusion
3. How Dask Works: The Basics
Execution BackendsLocal BackendsDistributed (Dask Client and Scheduler)Dask’s Diagnostics User InterfaceSerialization and PicklingPartitioning/Chunking CollectionsDask ArraysDask BagsDask DataFramesShufflesPartitions During LoadTasks, Graphs, and Lazy EvaluationLazy EvaluationTask DependenciesvisualizeIntermediate Task ResultsTask SizingWhen Task Graphs Get Too LargeCombining ComputationPersist, Caching, and MemoizationFault ToleranceConclusion
4. Dask DataFrame
How Dask DataFrames Are BuiltLoading and WritingFormatsFilesystemsIndexingShufflesRolling Windows and map_overlapAggregationsFull Shuffles and PartitioningEmbarrassingly Parallel OperationsWorking with Multiple DataFramesMulti-DataFrame InternalsMissing FunctionalityWhat Does Not WorkWhat’s SlowerHandling Recursive AlgorithmsRe-computed DataHow Other Functions Are DifferentData Science with Dask DataFrame: Putting It TogetherDeciding to Use DaskExploratory Data Analysis with DaskLoading DataPlotting DataInspecting DataConclusion
5. Dask’s Collections
Dask ArraysCommon Use CasesWhen Not to Use Dask ArraysLoading/SavingWhat’s MissingSpecial Dask FunctionsDask BagsCommon Use CasesLoading and Saving Dask BagsLoading Messy Data with a Dask BagLimitationsConclusion
6. Advanced Task Scheduling: Futures and Friends
Lazy and Eager Evaluation RevisitedUse Cases for FuturesLaunching FuturesFuture Life CycleFire-and-ForgetRetrieving ResultsNested FuturesConclusion
7. Adding Changeable/Mutable State with Dask Actors
What Is the Actor Model?Dask ActorsYour First Actor (It’s a Bank Account)Scaling Dask ActorsLimitationsWhen to Use Dask ActorsConclusion
8. How to Evaluate Dask’s Components and Libraries
Qualitative Considerations for Project EvaluationProject PrioritiesCommunityDask-Specific Best PracticesUp-to-Date DependenciesDocumentationOpenness to ContributionsExtensibilityQuantitative Metrics for Open Source Project EvaluationRelease HistoryCommit Frequency (and Volume)Library UsageCode and Best PracticesConclusion
9. Migrating Existing Analytic Engineering
Why Dask?Limitations of DaskMigration Road MapTypes of ClustersDevelopment: ConsiderationsDeployment MonitoringConclusion

10. Dask with GPUs and Other Special Resources
Transparent Versus Non-transparent AcceleratorsUnderstanding Whether GPUs or TPUs Can HelpMaking Dask Resource-AwareInstalling the LibrariesUsing Custom Resources Inside Your Dask TasksDecorators (Including Numba)GPUsGPU Acceleration Built on Top of DaskcuDFBlazingSQLcuStreamzFreeing Accelerator ResourcesDesign Patterns: CPU FallbackConclusion
11. Machine Learning with Dask
Parallelizing MLWhen to Use Dask-MLGetting Started with Dask-ML and XGBoostFeature EngineeringModel Selection and TrainingWhen There Is No Dask-ML EquivalentUse with Dask’s joblibXGBoost with DaskML Models with Dask-SQLInference and DeploymentDistributing Data and Models ManuallyLarge-Scale Inferences with DaskConclusion
12. Productionizing Dask: Notebooks, Deployment, Tuning, and Monitoring
Factors to Consider in a Deployment OptionBuilding Dask on a Kubernetes DeploymentDask on RayDask on YARNDask on High-Performance ComputingSetting Up Dask in a Remote ClusterConnecting a Local Machine to an HPC ClusterDask JupyterLab Extension and MagicsInstalling JupyterLab ExtensionsLaunching ClustersUIWatching ProgressUnderstanding Dask PerformanceMetrics in Distributed ComputingThe Dask DashboardSaving and Sharing Dask Metrics/Performance LogsAdvanced DiagnosticsScaling and Debugging Best PracticesManual ScalingAdaptive/Auto-scalingPersist and Delete Costly DataDask NannyWorker Memory ManagementCluster SizingChunking, RevisitedAvoid RechunkingScheduled JobsDeployment MonitoringConclusion
A. Key System Concepts for Dask Users
TestingManual TestingUnit TestingIntegration TestingTest-Driven DevelopmentProperty TestingWorking with NotebooksOut-of-Notebook TestingIn-Notebook Testing: In-Line AssertionsData and Output ValidationPeer-to-Peer Versus Centralized DistributedMethods of ParallelismTask ParallelismData ParallelismLoad BalancingNetwork Fault Tolerance and CAP TheoremRecursion (Tail and Otherwise)Versioning and Branching: Code and DataIsolation and Noisy NeighborsMachine Fault ToleranceScalability (Up and Down)Cache, Memory, Disk, and Networking: How the Performance ChangesHashingData LocalityExactly Once Versus At Least OnceConclusion
B. Scalable DataFrames: A Comparison and Some History
ToolsOne Machine OnlyDistributedConclusion
C. Debugging Dask
Using DebuggersGeneral Debugging Tips with DaskNative ErrorsSome Notes on Official Advice for Handling Bad RecordsDask DiagnosticsConclusion
D. Streaming with Streamz and Dask
Getting Started with Streamz on DaskStreaming Data Sources and SinksWord CountGPU Pipelines on Dask StreamingLimitations, Challenges, and WorkaroundsConclusion
Index
About the Authors

Content preview from Scaling Python with Dask

Chapter 9. Migrating Existing Analytic Engineering

Many users will already have analytic work that is currently deployed and that they want to migrate over to Dask. This chapter will discuss the considerations, challenges, and experiences of users making the switch. The main migration pathway explored in the chapter is moving an existing big data engineering job from another distributed framework, such as Spark, into Dask.

Why Dask?

Here are some reasons to consider migrating to Dask from an existing job that is implemented in pandas, or distributed libraries like PySpark:

Python and PyData stack: Many data scientists and developers prefer using a Python-native stack, where they don’t have to switch between languages or styles.
Richer ML integrations with Dask APIs: Futures, delayed, and ML integrations require less glue code from the developer to maintain, and there are performance improvements from the more flexible task graph management Dask offers.
Fine-grained task management: Dask’s task graph is generated and maintained in real time during runtime, and users can access the task dictionary synchronously.
Debugging overhead: Some developer teams prefer the debugging experience in Python, as opposed to mixed Python and Java/Scala stacktrace.
Development overhead: The development step in Dask can be done locally with ease with the developer’s laptop, as opposed to needing to connect to a powerful cloud machine in order to experiment.
Management UX: Dask visualization ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098119867Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Scaling Python with Dask

by Holden Karau, Mika Kimmins

Chapter 9. Migrating Existing Analytic Engineering

Why Dask?

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.