Chapter 1. Introduction

When I was a graduate student, I had a serious problem: a brand-new dataset, made up of millions of data points collected painstakingly over a full week on a nationally recognized plasma research device, that contained values that were much too small.

About 40 orders of magnitude too small.

My advisor and I huddled in his office, in front of the shiny new G5 Power Mac that ran our visualization suite, and tried to figure out what was wrong. The data had been acquired correctly from the machine. It looked like the original raw file from the experiment’s digitizer was fine. I had written a (very large) script in the IDL programming language on my Thinkpad laptop to turn the raw data into files the visualization tool could use. This in-house format was simplicity itself: just a short fixed-width header and then a binary dump of the floating-point data. Even so, I spent another hour or so writing a program to verify and plot the files on my laptop. They were fine. And yet, when loaded into the visualizer, all the data that looked so beautiful in IDL turned into a featureless, unstructured mush of values all around 10-41.

Finally it came to us: both the digitizer machines and my Thinkpad used the “little-endian” format to represent floating-point numbers, in contrast to the “big-endian” format of the G5 Mac. Raw values written on one machine couldn’t be read on the other, and vice versa. I remember thinking that’s so stupid (among other less polite variations). Learning that this problem was so common that IDL supplied a special routine to deal with it (SWAP_ENDIAN) did not improve my mood.

At the time, I didn’t care that much about the details of how my data was stored. This incident and others like it changed my mind. As a scientist, I eventually came to recognize that the choices we make for organizing and storing our data are also choices about communication. Not only do standard, well-designed formats make life easier for individuals (and eliminate silly time-wasters like the “endian” problem), but they make it possible to share data with a global audience.

Python and HDF5

In the Python world, consensus is rapidly converging on Hierarchical Data Format version 5, or “HDF5,” as the standard mechanism for storing large quantities of numerical data. As data volumes get larger, organization of data becomes increasingly important; features in HDF5 like named datasets (Chapter 3), hierarchically organized groups (Chapter 5), and user-defined metadata “attributes” (Chapter 6) become essential to the analysis process.

Structured, “self-describing” formats like HDF5 are a natural complement to Python. Two production-ready, feature-rich interface packages exist for HDF5, h5py, and PyTables, along with a number of smaller special-purpose wrappers.

Organizing Data and Metadata

Here’s a simple example of how HDF5’s structuring capability can help an application. Don’t worry too much about the details; later chapters explain both the details of how the file is structured, and how to use the HDF5 API from Python. Consider this a taste of what HDF5 can do for your application. If you want to follow along, you’ll need Python 2 with NumPy installed (see Chapter 2).

Suppose we have a NumPy array that represents some data from an experiment:

>>> import numpy as np
>>> temperature = np.random.random(1024)
>>> temperature
array([ 0.44149738,  0.7407523 ,  0.44243584, ...,  0.19018119,
        0.64844851,  0.55660748])

Let’s also imagine that these data points were recorded from a weather station that sampled the temperature, say, every 10 seconds. In order to make sense of the data, we have to record that sampling interval, or “delta-T,” somewhere. For now we’ll put it in a Python variable:

>>> dt = 10.0

The data acquisition started at a particular time, which we will also need to record. And of course, we have to know that the data came from Weather Station 15:

>>> start_time = 1375204299  # in Unix time
>>> station = 15

We could use the built-in NumPy function np.savez to store these values on disk. This simple function saves the values as NumPy arrays, packed together in a ZIP file with associated names:

>>> np.savez("weather.npz", data=temperature, start_time=start_time, station=
station)

We can get the values back from the file with np.load:

>>> out = np.load("weather.npz")
>>> out["data"]
array([ 0.44149738,  0.7407523 ,  0.44243584, ...,  0.19018119,
        0.64844851,  0.55660748])
>>> out["start_time"]
array(1375204299)
>>> out["station"]
array(15)

So far so good. But what if we have more than one quantity per station? Say there’s also wind speed data to record?

>>> wind = np.random.random(2048)
>>> dt_wind = 5.0   # Wind sampled every 5 seconds

And suppose we have multiple stations. We could introduce some kind of naming convention, I suppose: “wind_15” for the wind values from station 15, and things like “dt_wind_15” for the sampling interval. Or we could use multiple files…

In contrast, here’s how this application might approach storage with HDF5:

>>> import h5py
>>> f = h5py.File("weather.hdf5")
>>> f["/15/temperature"] = temperature
>>> f["/15/temperature"].attrs["dt"] = 10.0
>>> f["/15/temperature"].attrs["start_time"] = 1375204299
>>> f["/15/wind"] = wind
>>> f["/15/wind"].attrs["dt"] = 5.0
---
>>> f["/20/temperature"] = temperature_from_station_20
---
(and so on)

This example illustrates two of the “killer features” of HDF5: organization in hierarchical groups and attributes. Groups, like folders in a filesystem, let you store related datasets together. In this case, temperature and wind measurements from the same weather station are stored together under groups named “/15,” “/20,” etc. Attributes let you attach descriptive metadata directly to the data it describes. So if you give this file to a colleague, she can easily discover the information needed to make sense of the data:

>>> dataset = f["/15/temperature"]
>>> for key, value in dataset.attrs.iteritems():
...     print "%s: %s" % (key, value)
dt: 10.0
start_time: 1375204299

Coping with Large Data Volumes

As a high-level “glue” language, Python is increasingly being used for rapid visualization of big datasets and to coordinate large-scale computations that run in compiled languages like C and FORTRAN. It’s now relatively common to deal with datasets hundreds of gigabytes or even terabytes in size; HDF5 itself can scale up to exabytes.

On all but the biggest machines, it’s not feasible to load such datasets directly into memory. One of HDF5’s greatest strengths is its support for subsetting and partial I/O. For example, let’s take the 1024-element “temperature” dataset we created earlier:

>>> dataset = f["/15/temperature"]

Here, the object named dataset is a proxy object representing an HDF5 dataset. It supports array-like slicing operations, which will be familiar to frequent NumPy users:

>>> dataset[0:10]
array([ 0.44149738,  0.7407523 ,  0.44243584,  0.3100173 ,  0.04552416,
        0.43933469,  0.28550775,  0.76152561,  0.79451732,  0.32603454])
>>> dataset[0:10:2]
array([ 0.44149738,  0.44243584,  0.04552416,  0.28550775,  0.79451732])

Keep in mind that the actual data lives on disk; when slicing is applied to an HDF5 dataset, the appropriate data is found and loaded into memory. Slicing in this fashion leverages the underlying subsetting capabilities of HDF5 and is consequently very fast.

Another great thing about HDF5 is that you have control over how storage is allocated. For example, except for some metadata, a brand new dataset takes zero space, and by default bytes are only used on disk to hold the data you actually write.

For example, here’s a 2-terabyte dataset you can create on just about any computer:

>>> big_dataset = f.create_dataset("big", shape=(1024, 1024, 1024, 512), dtype='float32')

Although no storage is yet allocated, the entire “space” of the dataset is available to us. We can write anywhere in the dataset, and only the bytes on disk necessary to hold the data are used:

>>> big_dataset[344, 678, 23, 36] = 42.0

When storage is at a premium, you can even use transparent compression on a dataset-by-dataset basis (see Chapter 4):

>>> compressed_dataset = f.create_dataset("comp", shape=(1024,), dtype='int32', compression='gzip')
>>> compressed_dataset[:] = np.arange(1024)
>>> compressed_dataset[:]
array([   0,    1,    2, ..., 1021, 1022, 1023])

What Exactly Is HDF5?

HDF5 is a great mechanism for storing large numerical arrays of homogenous type, for data models that can be organized hierarchically and benefit from tagging of datasets with arbitrary metadata.

It’s quite different from SQL-style relational databases. HDF5 has quite a few organizational tricks up its sleeve (see Chapter 8, for example), but if you find yourself needing to enforce relationships between values in various tables, or wanting to perform JOINs on your data, a relational database is probably more appropriate. Likewise, for tiny 1D datasets you need to be able to read on machines without HDF5 installed. Text formats like CSV (with all their warts) are a reasonable alternative.

HDF5 is just about perfect if you make minimal use of relational features and have a need for very high performance, partial I/O, hierarchical organization, and arbitrary metadata.

So what, specifically, is “HDF5”? I would argue it consists of three things:

  1. A file specification and associated data model.
  2. A standard library with API access available from C, C++, Java, Python, and others.
  3. A software ecosystem, consisting of both client programs using HDF5 and “analysis platforms” like MATLAB, IDL, and Python.

HDF5: The File

In the preceding brief examples, you saw the three main elements of the HDF5 data model: datasets, array-like objects that store your numerical data on disk; groups, hierarchical containers that store datasets and other groups; and attributes, user-defined bits of metadata that can be attached to datasets (and groups!).

Using these basic abstractions, users can build specific “application formats” that organize data in a method appropriate for the problem domain. For example, our “weather station” code used one group for each station, and separate datasets for each measured parameter, with attributes to hold additional information about what the datasets mean. It’s very common for laboratories or other organizations to agree on such a “format-within-a-format” that specifies what arrangement of groups, datasets, and attributes are to be used to store information.

Since HDF5 takes care of all cross-platform issues like endianness, sharing data with other groups becomes a simple matter of manipulating groups, datasets, and attributes to get the desired result. And because the files are self-describing, even knowing about the application format isn’t usually necessary to get data out of the file. You can simply open the file and explore its contents:

>>> f.keys()
[u'15', u'big', u'comp']
>>> f["/15"].keys()
[u'temperature', u'wind']

Anyone who has spent hours fiddling with byte-offsets while trying to read “simple” binary formats can appreciate this.

Finally, the low-level byte layout of an HDF5 file on disk is an open specification. There are no mysteries about how it works, in contrast to proprietary binary formats. And although people typically use the library provided by the HDF Group to access files, nothing prevents you from writing your own reader if you want.

HDF5: The Library

The HDF5 file specification and open source library is maintained by the HDF Group, a nonprofit organization headquartered in Champaign, Illinois. Formerly part of the University of Illinois Urbana-Champaign, the HDF Group’s primary product is the HDF5 software library.

Written in C, with additional bindings for C++ and Java, this library is what people usually mean when they say “HDF5.” Both of the most popular Python interfaces, PyTables and h5py, are designed to use the C library provided by the HDF Group.

One important point to make is that this library is actively maintained, and the developers place a strong emphasis on backwards compatibility. This applies to both the files the library produces and also to programs that use the API. File compatibility is a must for an archival format like HDF5. Such careful attention to API compatibility is the main reason that packages like h5py and PyTables have been able to get traction with many different versions of HDF5 installed in the wild.

You should have confidence when using HDF5 for scientific data storage, including long-term storage. And since both the library and format are open source, your files will be readable even if a meteor takes out Illinois.

HDF5: The Ecosystem

Finally, one aspect that makes HDF5 particularly useful is that you can read and write files from just about every platform. The IDL language has supported HDF5 for years; MATLAB has similar support and now even uses HDF5 as the default format for its “.mat” save files. Bindings are also available for Python, C++, Java, .NET, and LabView, among others. Institutional users include NASA’s Earth Observing System, whose “EOS5” format is an application format on top of the HDF5 container, as in the much simpler example earlier. Even the newest version of the competing NetCDF format, NetCDF4, is implemented using HDF5 groups, datasets, and attributes.

Hopefully I’ve been able to share with you some of the things that make HDF5 so exciting for scientific use. Next, we’ll review the basics of how HDF5 works and get started on using it from Python.

Get Python and HDF5 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.