8 Storing big data

This chapter covers

  • Getting to know fsspec, an abstraction library over filesystems
  • Storing heterogeneous columnar data efficiently with Parquet
  • Processing data files with in-memory libraries like pandas or Parquet
  • Processing homogeneous multi-dimensional array data with Zarr

When dealing with big data, persistence is of paramount importance. We want to be able to access—to read and write—data as fast as possible, preferably from many parallel processes. We also want persistent representations that are compact because storing large amounts of data can be expensive.

In this chapter, we will consider several approaches to make persistent storage of data more efficient. We will start with a short discussion of fsspec, a library ...

Get Fast Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.