Chapter 1. What Is Dask?
Dask is a framework for parallelized computing with Python that scales from multiple cores on one machine to data centers with thousands of machines. It has both low-level task APIs and higher-level data-focused APIs. The low-level task APIs power Dask’s integration with a wide variety of Python libraries. Having public APIs has allowed an ecosystem of tools to grow around Dask for various use cases.
Continuum Analytics, now known as Anaconda Inc, started the open source, DARPA-funded Blaze project, which has evolved into Dask. Continuum has participated in developing many essential libraries and even conferences in the Python data analytics space. Dask remains an open source project, with much of its development now being supported by Coiled.
Dask is unique in the distributed computing ecosystem, because it integrates popular data science, parallel, and scientific computing libraries. Dask’s integration of different libraries allows developers to reuse much of their existing knowledge at scale. They can also frequently reuse some of their code with minimal changes.
Why Do You Need Dask?
Dask simplifies scaling analytics, ML, and other code written in Python,1 allowing you to handle larger and more complex data and problems. Dask aims to fill the space where your existing tools, like pandas DataFrames, or your scikit-learn machine learning pipelines start to become too slow (or do not succeed). While the term “big data” is perhaps less in vogue now than a few years ago, the data size of the problems has not gotten smaller, and the complexity of the computation and models has not gotten simpler. Dask allows you to primarily use the existing interfaces that you are used to (such as pandas and multi-processing) while going beyond the scale of a single core or even a single machine.
Note
On the other hand, if all your data fits in memory on a laptop, and you can finish your analysis before you’ve had a chance to brew a cup of your favorite warm beverage, you probably don’t need Dask yet.
Where Does Dask Fit in the Ecosystem?
Dask provides scalability to multiple, traditionally distinct tools. It is most often used to scale Python data libraries like pandas and NumPy. Dask extends existing tools for scaling, such as multi-processing, allowing them to exceed their current limits of single machines to multi-core and multi-machine. The following provides a quick look at the ecosystem evolution:
- Early “big data” query
-
Apache Hadoop and Apache Hive
- Later “big data” query
-
Apache Flink and Apache Spark
- DataFrame-focused distributed tools
-
Koalas, Ray, and Dask
From an abstraction point of view, Dask sits above the machines and cluster management tools, allowing you to focus on Python code instead of the intricacies of machine-to-machine communication:
- Scalable data and ML tools
-
Hadoop, Hive, Flink, Spark, TensorFlow, Koalas, Ray, Dask, etc.
- Compute resources
-
Apache Hadoop YARN, Kubernetes, Amazon Web Services, Slurm Workload Manager, etc.
We say a problem is compute-bound if the limiting factor is not the amount of data but rather the work we are doing on the data. Memory-bound problems are problems in which the computation is not the limiting factor; rather, the ability to store all the data in memory is the limiting factor. Some problems can be both compute-bound and memory-bound, as is often the case for large deep-learning problems.
Multi-core (think multi-threading) processing can help with compute-bound problems (up to the limit of the number of cores in a machine). Generally, multi-core processing is unable to help with memory-bound problems, as all Central Processing Units (CPUs) have similar access to the memory.2
Accelerated processing, including the use of specialized instruction sets or specialized hardware like Tensor Processing Units or Graphics Processing Units, is generally useful only for compute-bound problems. Sometimes using accelerated processing introduces memory-bound problems, as the amount of memory available to the accelerated computation can be smaller than the “main” system memory.
Multi-machine processing is important for both classes of problems. Since the number of cores you can get in a machine (affordably) is limited, even if a problem is “only” compute bound at certain scales, you will need to consider multi-machine processing. More commonly, memory-bound problems are a good fit for multi-machine scaling, as Dask can often split the data between the different machines.
Dask has both multi-core and multi-machine scaling, allowing you to scale your Python code as you see fit.
Much of Dask’s power comes from the tools and libraries built on top of it, which fit into their parts of the data processing ecosystem (such as BlazingSQL). Your background and interest will naturally shape how you first view Dask, so in the following subsections, we’ll briefly discuss how you can use Dask for different types of problems, as well as how it compares to some existing tools.
Big Data
Dask has better Python library integrations and lower overhead for tasks than many alternatives. Apache Spark (and its Python companion, PySpark) is one of the most popular tools for big data. Existing big data tools, such as PySpark, have more data sources and optimizers (like predicate push-down) but higher overhead per task. Dask’s lower overhead is due mainly to the rest of the Python big data ecosystem being built primarily on top of the JVM. These tools have advanced features such as query optimizers, but with the cost of copying data between the JVM and Python.
Unlike many other traditional big data tools, such as Spark and Hadoop, Dask considers local mode a first-class citizen. The traditional big data ecosystem focuses on using the local mode for testing, but Dask focuses on good performance when running on a single node.
Another significant cultural difference comes from packaging, with many projects in big data putting everything together (for example, Spark SQL, Spark Kubernetes, and so on are released together). Dask takes a more modular approach, with its components following their own development and release cadence. Dask’s approach can iterate faster, at the cost of occasional incompatibilities between libraries.
Data Science
One of the most popular Python libraries in the data science ecosystem is pandas. Apache Spark (and its Python companion, PySpark) is also one of the most popular tools for distributed data science. It has support for both Python and JVM languages. Spark’s first attempt at DataFrames more closely resembled SQL than what you may think of as DataFrames. While Spark has started to integrate pandas support with the Koalas project, Dask’s support of data science library APIs is best in class, in our opinion.3 In addition to the pandas APIs, Dask supports scaling NumPy, scikit-learn, and other data science tools.
Note
Dask can be extended to support data types besides NumPy and pandas, and this is how GPU support is implemented with cuDF.
Parallel to Distributed Python
Parallel computing refers to running multiple operations at the same time, and distributed computing carries this on to multiple operations on multiple machines. Parallel Python encompasses a wide variety of tools ranging from multi-processing to Celery.4 Dask gives you the ability to specify an arbitrary graph of dependencies and execute them in parallel. Under the hood, this execution can either be backed by a single machine (with threads or processes) or be distributed across multiple workers.
Note
Many big data tools have similar low-level task APIs, but they are internal and are not exposed for our use or protected against failures.
Dask Community Libraries
Dask’s true power comes from the ecosystem built around it. Different libraries are built on top of Dask, giving you the ability to use multiple tools in the same framework. These community libraries are so powerful in part because of the combination of low-level and high-level APIs that are available for more than just first-party development.
Accelerated Python
You can accelerate Python in a few different ways, ranging from code generation (such as Numba) to libraries for special hardware such as NVidia’s CUDA (and wrappers like cuDF), AMD’s ROCm, and Intel’s MKL.
Dask itself is not a library for accelerated Python, but you can use it in conjunction with accelerated Python tools. For ease of use, some community projects integrate acceleration tools, such as cuDF and dask-cuda, with Dask. When using accelerated Python tools with Dask, you’ll need to be careful to structure your code to avoid serialization errors (see “Serialization and Pickling”).
Note
Accelerated Python libraries tend to use more “native” memory structures, which are not as easily handled by pickle.
SQL engines
Dask itself does not have a SQL engine; however, FugueSQL, Dask-SQL, and BlazingSQL use Dask to provide a distributed SQL engine.5 Dask-SQL uses the popular Apache Calcite project, which powers many other SQL engines. BlazingSQL extends Dask DataFrames to support GPU operations. cuDF DataFrames have a slightly different representation. Apache Arrow makes it straightforward to convert a Dask DataFrame to cuDF and vice versa.
Dask allows these different SQL engines to scale both memory- and compute-wise, handling larger data sizes than fit in memory on a single computer and processing rows on multiple computers. Dask also powers the important aggregation step of combining the results from the different machines into a cohesive view of the data.
Tip
Dask-SQL can read data from parts of the Hadoop ecosystem that Dask cannot read from (e.g., Hive).
Workflow scheduling
Most organizations have the need for some kind of scheduled work, from programs that run at specific times (such as those that calculate end-of-day or end-of-month financials) to programs that run in response to events. These events can be things like data becoming available (such as after the daily financials are run) or a new email coming in, or they can be user triggered. In the simplest case the scheduled work can be a single program, but it is often more complex than that.
As mentioned previously, you can specify arbitrary graphs in Dask, and if you chose to, you could write your workflows using Dask itself. You can call system commands and parse their results, but just because you can do something doesn’t mean it will be fun or simple.
The household name6 for workflow scheduling in the big data ecosystem is Apache Airflow. While Airflow has a wonderful collection of operators, making it easy to express complex task types easily, it is notoriously difficult to scale.7 Dask can be used to run Airflow tasks. Alternatively, it can be used as a backend for other task scheduling systems like Prefect. Prefect aims to bring Airflow-like functionality to Dask with a large predefined task library. Since Prefect used Dask as an execution backend from the start, it has a tighter integration and lower overhead than Airflow on Dask.
Note
Few tools cover all of the same areas, with the most similar tool being Ray. Dask and Ray both expose Python APIs, with underlying extensions when needed. There is a GitHub issue where the creators of both systems compare their similarities and differences. From a systems perspective, the biggest differences between Ray and Dask are handling state, fault tolerance, and centralized versus decentralized scheduling. Ray implements more of its logic in C++, which can have performance benefits but is also more difficult to read. From a user point of view, Dask has more of a data science focus, and Ray emphasizes distributed state and actor support. Dask can use Ray as a backend for scheduling.8
What Dask Is Not
While Dask is many things, it is not a magic wand you wave over your code to make it faster. There are places where Dask has largely compatible drop-in APIs, but misusing them can result in slower execution. Dask is not a code rewriting or just-in-time (JIT) tool; instead, Dask allows you to scale these tools to run on clusters. Dask focuses on Python and may not be the right tool for scaling languages not tightly integrated with Python (such as Go). Dask does not have built-in catalog support (e.g., Hive or Iceberg), so reading and writing data from tables stored with the catalogs can pose a challenge.
Conclusion
Dask is one of the possible options for scaling your analytical Python code. It covers various deployment options, from multiple cores on a single computer to data centers. Dask takes a modular approach compared to many other tools in similar spaces, which means that taking the time to understand the ecosystem and libraries around it is essential. The right choice to scale your software depends on your code and on the ecosystem, data consumers, and sources for your project. We hope we’ve convinced you that it’s worth the time to play with Dask a bit, which you do in the next chapter.
1 Not all Python code, however; for example, Dask would be a bad choice for scaling a web server (very stateful from the web socket needs).
2 With the exception of non-uniform memory access (NUMA) systems.
3 Of course, opinions vary. See, for example, “Single Node Processing — Spark, Dask, Pandas, Modin, Koalas Vol. 1”, “Benchmark: Koalas (PySpark) and Dask”, and “Spark vs. Dask vs. Ray”.
4 Celery, often used for background job management, is an asynchronous task queue that can also split up and distribute work. But it is at a lower level than Dask and does not have the same high-level conveniences as Dask.
5 BlazingSQL is no longer maintained, though its concepts are interesting and may find life in another project.
6 Assuming a fairly nerdy household.
7 With one thousand tasks per hour taking substantial tuning and manual consideration; see “Scaling Airflow to 1000 Tasks/Hour”.
8 Or, flipping the perspective, Ray is capable of using Dask to provide data science functionality.
Get Scaling Python with Dask now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.