Appendix B. Scalable DataFrames: A Comparison and Some History

Dask’s distributed pandas-like DataFrame is, in our opinion, one of its key features. Various approaches exist to provide scalable DataFrame-like functionality. One of the big things that made Dask’s DataFrames stand out is the high level of support of the pandas APIs, which other projects are rapidly trying to catch up on. This appendix compares some of the different current and historical DataFrame libraries.

To understand the differences, we will look at a few key factors, some of which are similar to techniques we suggest in Chapter 8. The first one is what the API looks like, and how much of your existing skills and code using pandas can be transferred. Then we’ll look at how much work is forced to happen on a single thread, on the driver/head node, and then on a single worker node.

Scalable DataFrames does not have to mean distributed, although distributed scaling often allows for affordable handling of larger datasets than the single-machine options—and at truly massive scales, it’s the only practical option.

Tools

One of the common dependencies you’ll see in many of the tools is that they are built on top of ASF Arrow. While Arrow is a fantastic project, and we hope to see its continued adoption, it has some type differences, especially with respect to nullability.1 These differences mean that most of the systems built using Arrow share some common restrictions.

Open Multi-Processing (OpenMP) and Open Message ...

Get Scaling Python with Dask now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.