Chapter 10. Storing Data: Files and HDF5

HDF5 stands for Hierarchical Data Format 5, a free and open source binary file type specification. HDF5 is built and supported by the HDF Group, which is an organization that split off from the University of Illinois Champagne-Urbana. What makes HDF5 great is the numerous libraries written to interact with files of this type and its extremely rich feature set.

HDF5 has become the default binary database for scientific computing. Unlike other software developers, scientists tend not to be primarily concerned with variable-length strings, and our data is highly structured. What sets our data apart is the sheer quantity of it.

The Big Data regime often deals with tables that have millions to billions of rows. The cutting edge of computational science is trying to figure how out to deal with data on the order of 1016 to 1018. HDF5 is at the forefront of tackling this quantity of data. At this volume, data earns the term exascale because the size is roughly 1 exabyte. An exabyte is almost unimaginably large. And at this scale, any improvements that can be made to the storage size per element are worth implementing.

The beauty of HDF5 is that it works equally well on gargantuan data as it does on tiny datasets. This allows users to play around with subsets of their data on their laptops and then seamlessly deploy to the largest computers ever built, and everything in between.

A contributing factor to the popularity of HDF5 is that it is accessible ...

Get Effective Computation in Physics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.