book

Scaling Python with Dask

by Holden Karau, Mika Kimmins

July 2023

Intermediate to advanced

223 pages

5h 24m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
A Note on ResponsibilityConventions Used in This BookOnline FiguresLicenseUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. What Is Dask?
Why Do You Need Dask?Where Does Dask Fit in the Ecosystem?Big DataData ScienceParallel to Distributed PythonDask Community LibrariesWhat Dask Is NotConclusion
2. Getting Started with Dask
Installing Dask LocallyHello WorldsTask Hello WorldDistributed CollectionsDask DataFrame (Pandas/What People Wish Big Data Was)Conclusion
3. How Dask Works: The Basics
Execution BackendsLocal BackendsDistributed (Dask Client and Scheduler)Dask’s Diagnostics User InterfaceSerialization and PicklingPartitioning/Chunking CollectionsDask ArraysDask BagsDask DataFramesShufflesPartitions During LoadTasks, Graphs, and Lazy EvaluationLazy EvaluationTask DependenciesvisualizeIntermediate Task ResultsTask SizingWhen Task Graphs Get Too LargeCombining ComputationPersist, Caching, and MemoizationFault ToleranceConclusion
4. Dask DataFrame
How Dask DataFrames Are BuiltLoading and WritingFormatsFilesystemsIndexingShufflesRolling Windows and map_overlapAggregationsFull Shuffles and PartitioningEmbarrassingly Parallel OperationsWorking with Multiple DataFramesMulti-DataFrame InternalsMissing FunctionalityWhat Does Not WorkWhat’s SlowerHandling Recursive AlgorithmsRe-computed DataHow Other Functions Are DifferentData Science with Dask DataFrame: Putting It TogetherDeciding to Use DaskExploratory Data Analysis with DaskLoading DataPlotting DataInspecting DataConclusion
5. Dask’s Collections
Dask ArraysCommon Use CasesWhen Not to Use Dask ArraysLoading/SavingWhat’s MissingSpecial Dask FunctionsDask BagsCommon Use CasesLoading and Saving Dask BagsLoading Messy Data with a Dask BagLimitationsConclusion
6. Advanced Task Scheduling: Futures and Friends
Lazy and Eager Evaluation RevisitedUse Cases for FuturesLaunching FuturesFuture Life CycleFire-and-ForgetRetrieving ResultsNested FuturesConclusion
7. Adding Changeable/Mutable State with Dask Actors
What Is the Actor Model?Dask ActorsYour First Actor (It’s a Bank Account)Scaling Dask ActorsLimitationsWhen to Use Dask ActorsConclusion
8. How to Evaluate Dask’s Components and Libraries
Qualitative Considerations for Project EvaluationProject PrioritiesCommunityDask-Specific Best PracticesUp-to-Date DependenciesDocumentationOpenness to ContributionsExtensibilityQuantitative Metrics for Open Source Project EvaluationRelease HistoryCommit Frequency (and Volume)Library UsageCode and Best PracticesConclusion
9. Migrating Existing Analytic Engineering
Why Dask?Limitations of DaskMigration Road MapTypes of ClustersDevelopment: ConsiderationsDeployment MonitoringConclusion

10. Dask with GPUs and Other Special Resources
Transparent Versus Non-transparent AcceleratorsUnderstanding Whether GPUs or TPUs Can HelpMaking Dask Resource-AwareInstalling the LibrariesUsing Custom Resources Inside Your Dask TasksDecorators (Including Numba)GPUsGPU Acceleration Built on Top of DaskcuDFBlazingSQLcuStreamzFreeing Accelerator ResourcesDesign Patterns: CPU FallbackConclusion
11. Machine Learning with Dask
Parallelizing MLWhen to Use Dask-MLGetting Started with Dask-ML and XGBoostFeature EngineeringModel Selection and TrainingWhen There Is No Dask-ML EquivalentUse with Dask’s joblibXGBoost with DaskML Models with Dask-SQLInference and DeploymentDistributing Data and Models ManuallyLarge-Scale Inferences with DaskConclusion
12. Productionizing Dask: Notebooks, Deployment, Tuning, and Monitoring
Factors to Consider in a Deployment OptionBuilding Dask on a Kubernetes DeploymentDask on RayDask on YARNDask on High-Performance ComputingSetting Up Dask in a Remote ClusterConnecting a Local Machine to an HPC ClusterDask JupyterLab Extension and MagicsInstalling JupyterLab ExtensionsLaunching ClustersUIWatching ProgressUnderstanding Dask PerformanceMetrics in Distributed ComputingThe Dask DashboardSaving and Sharing Dask Metrics/Performance LogsAdvanced DiagnosticsScaling and Debugging Best PracticesManual ScalingAdaptive/Auto-scalingPersist and Delete Costly DataDask NannyWorker Memory ManagementCluster SizingChunking, RevisitedAvoid RechunkingScheduled JobsDeployment MonitoringConclusion
A. Key System Concepts for Dask Users
TestingManual TestingUnit TestingIntegration TestingTest-Driven DevelopmentProperty TestingWorking with NotebooksOut-of-Notebook TestingIn-Notebook Testing: In-Line AssertionsData and Output ValidationPeer-to-Peer Versus Centralized DistributedMethods of ParallelismTask ParallelismData ParallelismLoad BalancingNetwork Fault Tolerance and CAP TheoremRecursion (Tail and Otherwise)Versioning and Branching: Code and DataIsolation and Noisy NeighborsMachine Fault ToleranceScalability (Up and Down)Cache, Memory, Disk, and Networking: How the Performance ChangesHashingData LocalityExactly Once Versus At Least OnceConclusion
B. Scalable DataFrames: A Comparison and Some History
ToolsOne Machine OnlyDistributedConclusion
C. Debugging Dask
Using DebuggersGeneral Debugging Tips with DaskNative ErrorsSome Notes on Official Advice for Handling Bad RecordsDask DiagnosticsConclusion
D. Streaming with Streamz and Dask
Getting Started with Streamz on DaskStreaming Data Sources and SinksWord CountGPU Pipelines on Dask StreamingLimitations, Challenges, and WorkaroundsConclusion
Index
About the Authors

Content preview from Scaling Python with Dask

Chapter 1. What Is Dask?

Dask is a framework for parallelized computing with Python that scales from multiple cores on one machine to data centers with thousands of machines. It has both low-level task APIs and higher-level data-focused APIs. The low-level task APIs power Dask’s integration with a wide variety of Python libraries. Having public APIs has allowed an ecosystem of tools to grow around Dask for various use cases.

Continuum Analytics, now known as Anaconda Inc, started the open source, DARPA-funded Blaze project, which has evolved into Dask. Continuum has participated in developing many essential libraries and even conferences in the Python data analytics space. Dask remains an open source project, with much of its development now being supported by Coiled.

Dask is unique in the distributed computing ecosystem, because it integrates popular data science, parallel, and scientific computing libraries. Dask’s integration of different libraries allows developers to reuse much of their existing knowledge at scale. They can also frequently reuse some of their code with minimal changes.

Why Do You Need Dask?

Dask simplifies scaling analytics, ML, and other code written in Python,¹ allowing you to handle larger and more complex data and problems. Dask aims to fill the space where your existing tools, like pandas DataFrames, or your scikit-learn machine learning pipelines start to become too slow (or do not succeed). While the term “big data” is perhaps less in vogue now than ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098119867Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Scaling Python with Dask

by Holden Karau, Mika Kimmins

Chapter 1. What Is Dask?

Why Do You Need Dask?

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.