Dask Bag and DataFrame

Dask provides other data structures for automatic generation of computation graphs. In this subsection, we'll take a look at dask.bag.Bag, a generic collection of elements that can be used to code MapReduce-style algorithms, and dask.dataframe.DataFrame, a distributed version of pandas.DataFrame.

A Bag can be easily created from a Python collection. For example, you can create a Bag from a list using the from_sequence factory function. The level of parallelism can be specified using the npartitions argument (this will distribute the Bag content into a number of partitions). In the following example, we create a Bag containing numbers from 0 to 99, partitioned into four chunks:

    import dask.bag as dab dab.from_sequence(range(100), ...

Get Python High Performance - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.