Chapter 9. Advanced Data with Ray

Despite, or perhaps because of, data ecosystems’ rapid advances, you will likely end up needing to use multiple tools as part of your data pipeline. Ray Datasets allows data sharing among tools in the data and ML ecosystems. This allows you to switch tools without having to copy or move data. Ray Datasets supports Spark, Modin, Dask, and Mars and can also be used with ML tools like TensorFlow. You can also use Arrow with Ray to allow more tools to work on top of Datasets, such as R or even MATLAB. Ray Datasets act as a common format for all steps of your ML pipeline, simplifying legacy pipelines.

It all boils down to this: you can use the same dataset in multiple tools without worrying about the details. Internally, many of these tools have their own formats, but Ray and Arrow manage the translations transparently.

In addition to simplifying your use of different tools, Ray also has a growing collection of built-in operations for Datasets. These built-in operations are being actively developed and are not intended to be as full-featured as those of the data tools built on top of Ray.


As covered in “Ray Objects”, Ray Datasets’ default behavior may be different than you expect. You can enable object recovery by setting enable_object_reconstruction=True in ray.init to make Ray Datasets more resilient.

Ray Datasets continues to be an area of active development, including large feature additions between minor releases, and more functionality likely ...

Get Scaling Python with Ray now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.