Chapter 3. Working with Datasets

Datasets are the central feature of HDF5. You can think of them as NumPy arrays that live on disk. Every dataset in HDF5 has a name, a type, and a shape, and supports random access. Unlike the built-in and friends, there’s no need to read and write the entire array as a block; you can use the standard NumPy syntax for slicing to read and write just the parts you want.

Dataset Basics

First, let’s create a file so we have somewhere to store our datasets:

>>> f = h5py.File("testfile.hdf5")

Every dataset in an HDF5 file has a name. Let’s see what happens if we just assign a new NumPy array to a name in the file:

>>> arr = np.ones((5,2))
>>> f["my dataset"] = arr
>>> dset = f["my dataset"]
>>> dset
<HDF5 dataset "my dataset": shape (5, 2), type "<f8">

We put in a NumPy array but got back something else: an instance of the class h5py.Dataset. This is a “proxy” object that lets you read and write to the underlying HDF5 dataset on disk.

Type and Shape

Let’s explore the Dataset object. If you’re using IPython, type dset. and hit Tab to see the object’s attributes; otherwise, do dir(dset). There are a lot, but a few stand out:

>>> dset.dtype

Each dataset has a fixed type that is defined when it’s created and can never be changed. HDF5 has a vast, expressive type mechanism that can easily handle the built-in NumPy types, with few exceptions. For this reason, h5py always expresses the type of a dataset using standard NumPy dtype objects.

There’s ...

Get Python and HDF5 now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.