Chapter 9. Migrating Existing Analytic Engineering
Many users will already have analytic work that is currently deployed and that they want to migrate over to Dask. This chapter will discuss the considerations, challenges, and experiences of users making the switch. The main migration pathway explored in the chapter is moving an existing big data engineering job from another distributed framework, such as Spark, into Dask.
Why Dask?
Here are some reasons to consider migrating to Dask from an existing job that is implemented in pandas, or distributed libraries like PySpark:
- Python and PyData stack
-
Many data scientists and developers prefer using a Python-native stack, where they don’t have to switch between languages or styles.
- Richer ML integrations with Dask APIs
-
Futures, delayed, and ML integrations require less glue code from the developer to maintain, and there are performance improvements from the more flexible task graph management Dask offers.
- Fine-grained task management
-
Dask’s task graph is generated and maintained in real time during runtime, and users can access the task dictionary synchronously.
- Debugging overhead
-
Some developer teams prefer the debugging experience in Python, as opposed to mixed Python and Java/Scala stacktrace.
- Development overhead
-
The development step in Dask can be done locally with ease with the developer’s laptop, as opposed to needing to connect to a powerful cloud machine in order to experiment.
- Management UX
-
Dask visualization ...
Get Scaling Python with Dask now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.