book

Scaling Python with Dask

by Holden Karau, Mika Kimmins

July 2023

Intermediate to advanced

223 pages

5h 24m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
A Note on ResponsibilityConventions Used in This BookOnline FiguresLicenseUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. What Is Dask?
Why Do You Need Dask?Where Does Dask Fit in the Ecosystem?Big DataData ScienceParallel to Distributed PythonDask Community LibrariesWhat Dask Is NotConclusion
2. Getting Started with Dask
Installing Dask LocallyHello WorldsTask Hello WorldDistributed CollectionsDask DataFrame (Pandas/What People Wish Big Data Was)Conclusion
3. How Dask Works: The Basics
Execution BackendsLocal BackendsDistributed (Dask Client and Scheduler)Dask’s Diagnostics User InterfaceSerialization and PicklingPartitioning/Chunking CollectionsDask ArraysDask BagsDask DataFramesShufflesPartitions During LoadTasks, Graphs, and Lazy EvaluationLazy EvaluationTask DependenciesvisualizeIntermediate Task ResultsTask SizingWhen Task Graphs Get Too LargeCombining ComputationPersist, Caching, and MemoizationFault ToleranceConclusion
4. Dask DataFrame
How Dask DataFrames Are BuiltLoading and WritingFormatsFilesystemsIndexingShufflesRolling Windows and map_overlapAggregationsFull Shuffles and PartitioningEmbarrassingly Parallel OperationsWorking with Multiple DataFramesMulti-DataFrame InternalsMissing FunctionalityWhat Does Not WorkWhat’s SlowerHandling Recursive AlgorithmsRe-computed DataHow Other Functions Are DifferentData Science with Dask DataFrame: Putting It TogetherDeciding to Use DaskExploratory Data Analysis with DaskLoading DataPlotting DataInspecting DataConclusion
5. Dask’s Collections
Dask ArraysCommon Use CasesWhen Not to Use Dask ArraysLoading/SavingWhat’s MissingSpecial Dask FunctionsDask BagsCommon Use CasesLoading and Saving Dask BagsLoading Messy Data with a Dask BagLimitationsConclusion
6. Advanced Task Scheduling: Futures and Friends
Lazy and Eager Evaluation RevisitedUse Cases for FuturesLaunching FuturesFuture Life CycleFire-and-ForgetRetrieving ResultsNested FuturesConclusion
7. Adding Changeable/Mutable State with Dask Actors
What Is the Actor Model?Dask ActorsYour First Actor (It’s a Bank Account)Scaling Dask ActorsLimitationsWhen to Use Dask ActorsConclusion
8. How to Evaluate Dask’s Components and Libraries
Qualitative Considerations for Project EvaluationProject PrioritiesCommunityDask-Specific Best PracticesUp-to-Date DependenciesDocumentationOpenness to ContributionsExtensibilityQuantitative Metrics for Open Source Project EvaluationRelease HistoryCommit Frequency (and Volume)Library UsageCode and Best PracticesConclusion
9. Migrating Existing Analytic Engineering
Why Dask?Limitations of DaskMigration Road MapTypes of ClustersDevelopment: ConsiderationsDeployment MonitoringConclusion

10. Dask with GPUs and Other Special Resources
Transparent Versus Non-transparent AcceleratorsUnderstanding Whether GPUs or TPUs Can HelpMaking Dask Resource-AwareInstalling the LibrariesUsing Custom Resources Inside Your Dask TasksDecorators (Including Numba)GPUsGPU Acceleration Built on Top of DaskcuDFBlazingSQLcuStreamzFreeing Accelerator ResourcesDesign Patterns: CPU FallbackConclusion
11. Machine Learning with Dask
Parallelizing MLWhen to Use Dask-MLGetting Started with Dask-ML and XGBoostFeature EngineeringModel Selection and TrainingWhen There Is No Dask-ML EquivalentUse with Dask’s joblibXGBoost with DaskML Models with Dask-SQLInference and DeploymentDistributing Data and Models ManuallyLarge-Scale Inferences with DaskConclusion
12. Productionizing Dask: Notebooks, Deployment, Tuning, and Monitoring
Factors to Consider in a Deployment OptionBuilding Dask on a Kubernetes DeploymentDask on RayDask on YARNDask on High-Performance ComputingSetting Up Dask in a Remote ClusterConnecting a Local Machine to an HPC ClusterDask JupyterLab Extension and MagicsInstalling JupyterLab ExtensionsLaunching ClustersUIWatching ProgressUnderstanding Dask PerformanceMetrics in Distributed ComputingThe Dask DashboardSaving and Sharing Dask Metrics/Performance LogsAdvanced DiagnosticsScaling and Debugging Best PracticesManual ScalingAdaptive/Auto-scalingPersist and Delete Costly DataDask NannyWorker Memory ManagementCluster SizingChunking, RevisitedAvoid RechunkingScheduled JobsDeployment MonitoringConclusion
A. Key System Concepts for Dask Users
TestingManual TestingUnit TestingIntegration TestingTest-Driven DevelopmentProperty TestingWorking with NotebooksOut-of-Notebook TestingIn-Notebook Testing: In-Line AssertionsData and Output ValidationPeer-to-Peer Versus Centralized DistributedMethods of ParallelismTask ParallelismData ParallelismLoad BalancingNetwork Fault Tolerance and CAP TheoremRecursion (Tail and Otherwise)Versioning and Branching: Code and DataIsolation and Noisy NeighborsMachine Fault ToleranceScalability (Up and Down)Cache, Memory, Disk, and Networking: How the Performance ChangesHashingData LocalityExactly Once Versus At Least OnceConclusion
B. Scalable DataFrames: A Comparison and Some History
ToolsOne Machine OnlyDistributedConclusion
C. Debugging Dask
Using DebuggersGeneral Debugging Tips with DaskNative ErrorsSome Notes on Official Advice for Handling Bad RecordsDask DiagnosticsConclusion
D. Streaming with Streamz and Dask
Getting Started with Streamz on DaskStreaming Data Sources and SinksWord CountGPU Pipelines on Dask StreamingLimitations, Challenges, and WorkaroundsConclusion
Index
About the Authors

Content preview from Scaling Python with Dask

Chapter 4. Dask DataFrame

Pandas DataFrames, while popular, quickly run into memory constraints as data sizes grow, since they store the entirety of the data in memory. Pandas DataFrames have a robust API for all kinds of data manipulation and are frequently the starting point for many analytics and machine learning projects. While pandas itself does not have machine learning built in, data scientists often use it as part of data and feature preparation during the exploratory phase of new projects. As such, scaling pandas DataFrames to be able to handle large datasets is of vital importance to many data scientists. Most data scientists are already familiar with the pandas libraries, and Dask’s DataFrame implements much of the pandas API while adding the ability to scale.

Dask is one of the first to implement a usable subset of the pandas APIs, but other projects such as Spark have added their approaches. This chapter assumes you have a good understanding of the pandas DataFrame APIs; if not, you should check out Python for Data Analysis.

You can often use Dask DataFrames as a replacement for pandas DataFrames with minor changes, thanks to duck-typing. However, this approach can have performance drawbacks, and some functions are not present. These drawbacks come from the distributed parallel nature of Dask, which adds communication costs for certain types of operations. In this chapter, you will learn how to minimize these performance drawbacks and work around any missing functionality. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098119867Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Scaling Python with Dask

by Holden Karau, Mika Kimmins

Chapter 4. Dask DataFrame

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.