Preface
We wrote this book for data scientists and data engineers familiar with Python and pandas who are looking to handle larger-scale problems than their current tooling allows. Current PySpark users will find that some of this material overlaps with their existing knowledge of PySpark, but we hope they still find it helpful, and not just for getting away from the Java Virtual Machine (JVM).
If you are not familiar with Python, some excellent O’Reilly titles include Learning Python and Python for Data Analysis. If you and your team are more frequent users of JVM languages (such as Java or Scala), while we are a bit biased, we’d encourage you to check out Apache Spark along with Learning Spark (O’Reilly) and High Performance Spark (O’Reilly).
This book is primarily focused on data science and related tasks because, in our opinion, that is where Dask excels the most. If you have a more general problem that Dask does not seem to be quite the right fit for, we would (with a bit of bias again) encourage you to check out Scaling Python with Ray (O’Reilly), which has less of a data science focus.
A Note on Responsibility
As the saying goes, with great power comes great responsibility. Dask and tools like it enable you to process more data and build more complex models. It’s essential not to get carried away with collecting data simply for the sake of it, and to stop to ask yourself if including a new field in your model might have some unintended real-world implications. You don’t ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access