Chapter 4. Using Public Datasets with TensorFlow Datasets

In the first chapters of this book you trained models using a variety of data, from the Fashion MNIST dataset that is conveniently bundled with Keras to the image-based Horses or Humans and Dogs vs. Cats datasets, which were available as ZIP files that you had to download and preprocess. You’ve probably already realized that there are lots of different ways of getting the data with which to train a model.

However, many public datasets require you to learn lots of different domain-specific skills before you begin to consider your model architecture. The goal behind TensorFlow Datasets (TFDS) is to expose datasets in a way that’s easy to consume, where all the preprocessing steps of acquiring the data and getting it into TensorFlow-friendly APIs are done for you.

You’ve already seen a little of this idea with how Keras handled Fashion MNIST back in Chapters 1 and 2. As a recap, all you had to do to get the data was this:

data = tf.keras.datasets.fashion_mnist

(training_images, training_labels), (test_images, test_labels) = 
data.load_data()

TFDS builds on this idea, but greatly expands not only the number of datasets available but the diversity of dataset types. The list of available datasets is growing all the time, in categories such as:

Audio: Speech and music data
Image: From simple learning datasets like Horses or Humans up to advanced research datasets for uses such as diabetic retinopathy detection
Object detection: COCO, Open Images, and more
Structured data: Titanic survivors, Amazon reviews, and more
Summarization: News from CNN and the Daily Mail, scientific papers, wikiHow, and more
Text: IMDb reviews, natural language questions, and more
Translate: Various translation training datasets
Video: Moving MNIST, Starcraft, and more

Note

TensorFlow Datasets is a separate install from TensorFlow, so be sure to install it before trying out any samples! If you are using Google Colab, it’s already preinstalled.

This chapter will introduce you to TFDS and how you can use it to greatly simplify the training process. We’ll explore the underlying TFRecord structure and how it can provide commonality regardless of the type of the underlying data. You’ll also learn about the Extract-Transform-Load (ETL) pattern using TFDS, which can be used to train models with huge amounts of data efficiently.

Getting Started with TFDS

Let’s go through some simple examples of how to use TFDS to illustrate how it gives us a standard interface to our data, regardless of data type.

If you need to install it, you can do so with a pip command:

pip install tensorflow-datasets

Once it’s installed, you can use it to get access to a dataset with tfds.load, passing it the name of the desired dataset. For example, if you want to use Fashion MNIST, you can use code like this:

import tensorflow as tf
import tensorflow_datasets as tfds
mnist_data = tfds.load("fashion_mnist")
for item in mnist_data:
  print(item)

Be sure to inspect the data type that you get in return from the tfds.load command—the output from printing the items will be the different splits that are natively available in the data. In this case it’s a dictionary containing two strings, test and train. These are the available splits.

If you want to load these splits into a dataset containing the actual data, you can simply specify the split you want in the tfds.load command, like this:

mnist_train = tfds.load(name="fashion_mnist", split="train")
assert isinstance(mnist_train, tf.data.Dataset)
print(type(mnist_train))

In this instance, you’ll see that the output is a DatasetAdapter, which you can iterate through to inspect the data. One nice feature of this adapter is you can simply call take(1) to get the first record. Let’s do that to inspect what the data looks like:

for item in mnist_train.take(1):
  print(type(item))
  print(item.keys())

The output from the first print will show that the type of item in each record is a dictionary. When we print the keys to that we’ll see that in this image set the types are image and label. So, if we want to inspect a value in the dataset, we can do something like this:

for item in mnist_train.take(1):
  print(type(item))
  print(item.keys())
  print(item['image'])
  print(item['label'])

You’ll see the output for the image is a 28 × 28 array of values (in a tf.Tensor) from 0–255 representing the pixel intensity. The label will be output as tf.Tensor(2, shape=(), dtype=int64), indicating that this image is class 2 in the dataset.

Data about the dataset is also available using the with_info parameter when loading the dataset, like this:

mnist_test, info = tfds.load(name="fashion_mnist", with_info="true")
print(info)

Printing the info will give you details about the contents of the dataset. For example, for Fashion MNIST, you’ll see output like this:

tfds.core.DatasetInfo(
    name='fashion_mnist',
    version=3.0.0,
    description='Fashion-MNIST is a dataset of Zalando's article images
      consisting of a training set of 60,000 examples and a test set of 10,000
      examples. Each example is a 28x28 grayscale image, associated with a
      label from 10 classes.',
    homepage='https://github.com/zalandoresearch/fashion-mnist',
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    total_num_examples=70000,
    splits={
        'test': 10000,
        'train': 60000,
    },
    supervised_keys=('image', 'label'),
    citation="""@article{DBLP:journals/corr/abs-1708-07747,
      author    = {Han Xiao and
                   Kashif Rasul and
                   Roland Vollgraf},
      title     = {Fashion-MNIST: a Novel Image Dataset for Benchmarking 
                   Machine Learning
                   Algorithms},
      journal   = {CoRR},
      volume    = {abs/1708.07747},
      year      = {2017},
      url       = {http://arxiv.org/abs/1708.07747},
      archivePrefix = {arXiv},
      eprint    = {1708.07747},
      timestamp = {Mon, 13 Aug 2018 16:47:27 +0200},
      biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1708-07747},
      bibsource = {dblp computer science bibliography, https://dblp.org}
    }""",
    redistribution_info=,
)

Within this you can see details such as the splits (as demonstrated earlier) and the features within the dataset, as well as extra information like the citation, description, and dataset version.

Using TFDS with Keras Models

In Chapter 2 you saw how to create a simple computer vision model using TensorFlow and Keras, with the built-in datasets from Keras (including Fashion MNIST), using simple code like this:

mnist = tf.keras.datasets.fashion_mnist

(training_images, training_labels), 
(test_images, test_labels) = mnist.load_data()

When using TFDS the code is very similar, but with some minor changes. The Keras datasets gave us ndarray types that worked natively in model.fit, but with TFDS we’ll need to do a little conversion work:

(training_images, training_labels), 
(test_images, test_labels) =  
tfds.as_numpy(tfds.load('fashion_mnist',
                         split = ['train', 'test'], 
                         batch_size=-1, 
                         as_supervised=True))

In this case we use tfds.load, passing it fashion_mnist as the desired dataset. We know that it has train and test splits, so passing these in an array will return us an array of dataset adapters with the images and labels in them. Using tfds.as_numpy in the call to tfds.load causes them to be returned as Numpy arrays. Specifying batch_size=-1 gives us all of the data, and as_supervised=True ensures we get tuples of (input, label) returned.

Once we’ve done that, we have pretty much the same format of data that was available in the Keras datasets, with one modification—the shape in TFDS is (28, 28, 1), whereas in the Keras datasets it was (28, 28).

This means the code needs to change a little to specify that the input data shape is (28, 28, 1) instead of (28, 28):

import tensorflow as tf
import tensorflow_datasets as tfds

(training_images, training_labels), (test_images, test_labels) =  
tfds.as_numpy(tfds.load('fashion_mnist', split = ['train', 'test'], 
batch_size=-1, as_supervised=True))

training_images = training_images / 255.0
test_images = test_images / 255.0

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28,1)),
    tf.keras.layers.Dense(128, activation=tf.nn.relu),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(training_images, training_labels, epochs=5)

For a more complex example, you can take a look at the Horses or Humans dataset used in Chapter 3. This is also available in TFDS. Here’s the complete code to train a model with it:

import tensorflow as tf
import tensorflow_datasets as tfds

data = tfds.load('horses_or_humans', split='train', as_supervised=True)
 
train_batches = data.shuffle(100).batch(10)

model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(16, (3,3), activation='relu', 
                           input_shape=(300, 300, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='Adam', loss='binary_crossentropy',
metrics=['accuracy'])

history = model.fit(train_batches, epochs=10)

As you can see, it’s pretty straightforward: simply call tfds.load, passing it the split that you want (in this case train), and use that in the model. The data is batched and shuffled to make training more effective.

The Horses or Humans dataset is split into training and test sets, so if you want to do validation of your model while training, you can do so by loading a separate validation set from TFDS like this:

val_data = tfds.load('horses_or_humans', split='test', as_supervised=True)

You’ll need to batch it, the same as you did for the training set. For example:

validation_batches = val_data.batch(32)

Then, when training, you specify the validation data as these batches. You have to explicitly set the number of validation steps to use per epoch too, or TensorFlow will throw an error. If you’re not sure, just set it to 1 like this:

history = model.fit(train_batches, epochs=10,
validation_data=validation_batches, validation_steps=1)

Loading Specific Versions

All datasets stored in TFDS use a MAJOR.MINOR.PATCH numbering system. The guarantees of this system are as follows. If PATCH is updated, then the data returned by a call is identical, but the underlying organization may have changed. Any changes should be invisible to developers. If MINOR is updated, then the data is still unchanged, with the exception that there may be additional features in each record (nonbreaking changes). Also, for any particular slice (see “Using Custom Splits”) the data will be the same, so records aren’t reordered. If MAJOR is updated, then there may be changes in the format of the records and their placement, so that particular slices may return different values.

When you inspect datasets, you will see when there are different versions available—for example, this is the case for the cnn_dailymail dataset. If you don’t want the default one, which at time of writing was 3.0.0, and instead want an earlier one, such as 1.0.0, you can simply load it like this:

data, info = tfds.load("cnn_dailymail:1.0.0", with_info=True)

Note that if you are using Colab, it’s always a good idea to check the version of TFDS that it uses. At time of writing, Colab was preconfigured for TFDS 2.0, but there are some bugs in loading datasets (including the cnn_dailymail one) that have been fixed in TFDS 2.1 and later, so be sure to use one of those versions, or at least install them into Colab, instead of relying on the built-in default.

Using Mapping Functions for Augmentation

In Chapter 3 you saw the useful augmentation tools that were available when using an ImageDataGenerator to provide the training data for your model. You may be wondering how you might achieve the same when using TFDS, as you aren’t flowing the images from a subdirectory like before. The best way to achieve this—or indeed any other form of transformation—is to use a mapping function on the data adapter. Let’s take a look at how to do that.

Earlier, with our Horses or Humans data, we simply loaded the data from TFDS and created batches for it like this:

data = tfds.load('horses_or_humans', split='train', as_supervised=True)
 
train_batches = data.shuffle(100).batch(10)

To do transforms and have them mapped to the dataset, you can create a mapping function. This is just standard Python code. For example, suppose you create a function called augmentimages and have it do some image augmentation, like this:

def augmentimages(image, label):
  image = tf.cast(image, tf.float32)
  image = (image/255)
  image = tf.image.random_flip_left_right(image)
  return image, label

You can then map this to the data to create a new dataset called train:

train = data.map(augmentimages)

Then when you create the batches, do this from train instead of from data, like this:

train_batches = train.shuffle(100).batch(32)

You can see in the augmentimages function that there is a random flip left or right of the image, done using tf.image.random_flip_left_right(image). There are lots of functions in the tf.image library that you can use for augmentation; see the documentation for details.

Using TensorFlow Addons

The TensorFlow Addons library contains even more functions that you can use. Some of the functions in the ImageDataGenerator augmentation (such as rotate) can only be found there, so it’s a good idea to check it out.

Using TensorFlow Addons is pretty easy—you simply install the library with:

pip install tensorflow-addons

Once that’s done, you can mix the addons into your mapping function. Here’s an example where the rotate addon is used in the mapping function from earlier:

import tensorflow_addons as tfa

def augmentimages(image, label):
  image = tf.cast(image, tf.float32)
  image = (image/255)
  image = tf.image.random_flip_left_right(image)
  image = tfa.image.rotate(image, 40, interpolation='NEAREST')
  return image, label

Using Custom Splits

Up to this point, all of the data you’ve been using to build models has been presplit into training and test sets for you. For example, with Fashion MNIST you had 60,000 and 10,000 records, respectively. But what if you don’t want to use those splits? What if you want to split the data yourself according to your own needs? That’s one of the aspects of TFDS that’s really powerful—it comes complete with an API that gives you fine, granular control over how you split your data.

You’ve actually seen it already when loading data like this:

data = tfds.load('cats_vs_dogs', split='train', as_supervised=True)

Note that the split parameter is a string, and in this case you’re asking for the train split, which happens to be the entire dataset. If you’re familiar with Python slice notation, you can use that as well. This notation can be summarized as defining your desired slices within square brackets like this: [<start>: <stop>: <step>]. It’s quite a sophisticated syntax giving you great flexibility.

For example, if you want the first 10,000 records of train to be your training data, you can omit <start> and just call for train[:10000] (a useful mnemonic is to read the leading colon as “the first,” so this would read “train the first 10,000 records”):

data = tfds.load('cats_vs_dogs', split='train[:10000]', as_supervised=True)

You can also use % to specify the split. For example, if you want the first 20% of the records to be used for training, you could use :20% like this:

data = tfds.load('cats_vs_dogs', split='train[:20%]', as_supervised=True)

You could even get a little crazy and combine splits. That is, if you want your training data to be a combination of the first and last thousand records, you could do the following (where -1000: means “the last 1,000 records” and :1000 means “the first 1,000 records”):

data = tfds.load('cats_vs_dogs', split='train[-1000:]+train[:1000]', 
                 as_supervised=True)

The Dogs vs. Cats dataset doesn’t have fixed training, test, and validation splits, but, with TFDS, creating your own is simple. Suppose you want the split to be 80%, 10%, 10%. You could create the three sets like this:

train_data = tfds.load('cats_vs_dogs', split='train[:80%]', 
                       as_supervised=True)


validation_data = tfds.load('cats_vs_dogs', split='train[80%:90%]', 
                            as_supervised=True)


test_data = tfds.load('cats_vs_dogs', split='train[-10%:]',
                       as_supervised=True)

Once you have them, you can use them as you would any named split.

One caveat is that because the datasets that are returned can’t be interrogated for length, it’s often difficult to check that you have split the original set correctly. To see how many records you have in a split, you have to iterate through the whole set and count them one by one. Here’s the code to do that for the training set you just created:

train_length = [i for i,_ in enumerate(train_data)][-1] + 1
print(train_length)

This can be a slow process, so be sure to use it only when you’re debugging!

Understanding TFRecord

When you’re using TFDS, your data is downloaded and cached to disk so that you don’t need to download it each time you use it. TFDS uses the TFRecord format for caching. If you watch closely as it’s downloading the data you’ll see this—for example, Figure 4-1 shows how the cnn_dailymail dataset is downloaded, shuffled, and written to a TFRecord file.

Downloading the cnn_dailymail dataset as a TFRecord file

This is the preferred format in TensorFlow for storing and retrieving large amounts of data. It’s a very simple file structure, read sequentially for better performance. On disk the file is pretty straightforward, with each record consisting of an integer indicating the length of the record, a cyclic redundancy check (CRC) of that, a byte array of the data, and a CRC of that byte array. The records are concatenated into the file and then sharded in the case of large datasets.

For example, Figure 4-2 shows how the training set from cnn_dailymail is sharded into 16 files after download.

To take a look at a simpler example, download the MNIST dataset and print its info:

data, info = tfds.load("mnist", with_info=True)
print(info)

Within the info you’ll see that its features are stored like this:

features=FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
}),

Similar to the CNN/DailyMail example, the file is downloaded to /root/tensorflow_datasets/mnist/<version>/files.

You can load the raw records as a TFRecordDataset like this:

filename="/root/tensorflow_datasets/mnist/3.0.0/
                                   mnist-test.tfrecord-00000-of-00001"
raw_dataset = tf.data.TFRecordDataset(filename)
for raw_record in raw_dataset.take(1):

print(repr(raw_record))

Note that your filename location may be different depending on your operating system.

Inspecting the TFRecords for cnn_dailymail

This will print out the raw contents of the record, like this:

<tf.Tensor: shape=(), dtype=string,
numpy=b"\n\x85\x03\n\xf2\x02\n\x05image\x12\xe8\x02\n\xe5\x02\n\xe2\x02\x89PNG\r
\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x1c\x00\x00\x00\x1c\x08\x00\x00\x00\x00Wf
\x80H\x00\x00\x01)IDAT(\x91\xc5\xd2\xbdK\xc3P\x14\x05\xf0S(v\x13)\x04,.\x82\xc5A
q\xac\xedb\x1d\xdc\n.\x12\x87n\x0e\x82\x93\x7f@Q\xb2\x08\xba\tbQ0.\xe2\xe2\xd4\x
b1\xa2h\x9c\x82\xba\x8a(\nq\xf0\x83Fh\x95\n6\x88\xe7R\x87\x88\xf9\xa8Y\xf5\x0e\x
8f\xc7\xfd\xdd\x0b\x87\xc7\x03\xfe\xbeb\x9d\xadT\x927Q\xe3\xe9\x07:\xab\xbf\xf4\
xf3\xcf\xf6\x8a\xd9\x14\xd29\xea\xb0\x1eKH\xde\xab\xea%\xaba\x1b=\xa4P/\xf5\x02\
xd7\\\x07\x00\xc4=,L\xc0,>\x01@2\xf6\x12\xde\x9c\xde[t/\xb3\x0e\x87\xa2\xe2\
xc2\xe0A<\xca\xb26\xd5(\x1b\xa9\xd3\xe8\x0e\xf5\x86\x17\xceE\xdarV\xae\xb7_\xf3
I\xf7(\x06m\xaaE\xbb\xb6\xac\r*\x9b$e<\xb8\xd7\xa2\x0e\x00\xd0l\x92\xb2\xd5\x15\
xcc\xae'\x00\xf4m\x08O'+\xc2y\x9f\x8d\xc9\x15\x80\xfe\x99[q\x962@CN|i\xf7\xa9!=\
\xab\x19\x00\xc8\xd6\xb8\xeb\xa1\xf0\xd8l\xca\xfb]\xee\xfb]*\x9fV\xe1\x07\xb7\xc
9\x8b55\xe7M\xef\xb0\x04\xc0\xfd&\x89\x01<\xbe\xf9\x03*\x8a\xf5\x81\x7f\xaa/2y\x
87ks\xec\x1e\xc1\x00\x00\x00\x00IEND\xaeB`\x82\n\x0e\n\x05label\x12\x05\x1a\x03\
n\x01\x02">

It’s a long string containing the details of the record, along with checksums, etc. But if we already know the features, we can create a feature description and use this to parse the data. Here’s the code:

# Create a description of the features
feature_description = {
    'image': tf.io.FixedLenFeature([], dtype=tf.string),
    'label': tf.io.FixedLenFeature([], dtype=tf.int64),
}

def _parse_function(example_proto):
  # Parse the input `tf.Example` proto using the dictionary above
  return tf.io.parse_single_example(example_proto, feature_description)

parsed_dataset = raw_dataset.map(_parse_function)
for parsed_record in parsed_dataset.take(1):
  print((parsed_record))

The output of this is a little friendlier! First of all, you can see that the image is a Tensor, and that it contains a PNG. PNG is a compressed image format with a header defined by IHDR and the image data between IDAT and IEND. If you look closely, you can see them in the byte stream. There’s also the label, stored as an int and containing the value 2:

{'image': <tf.Tensor: shape=(), dtype=string,
numpy=b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x1c\x00\x00\x00\x1c\x08\
x00\x00\x00\x00Wf\x80H\x00\x00\x01)IDAT(\x91\xc5\xd2\xbdK\xc3P\x14\x05\xf0S(v\x1
3)\x04,.\x82\xc5Aq\xac\xedb\x1d\xdc\n.\x12\x87n\x0e\x82\x93\x7f@Q\xb2\x08\xba\tb
Q0.\xe2\xe2\xd4\xb1\xa2h\x9c\x82\xba\x8a(\nq\xf0\x83Fh\x95\n6\x88\xe7R\x87\x88\x
f9\xa8Y\xf5\x0e\x8f\xc7\xfd\xdd\x0b\x87\xc7\x03\xfe\xbeb\x9d\xadT\x927Q\xe3\xe9\
x07:\xab\xbf\xf4\xf3\xcf\xf6\x8a\xd9\x14\xd29\xea\xb0\x1eKH\xde\xab\xea%\xaba\x1
b=\xa4P/\xf5\x02\xd7\\\x07\x00\xc4=,L\xc0,>\x01@2\xf6\x12\xde\x9c\xde[t/\xb3\x0e
\x87\xa2\xe2\xc2\xe0A<\xca\xb26\xd5(\x1b\xa9\xd3\xe8\x0e\xf5\x86\x17\xceE\xdarV\
xae\xb7_\xf3AR\r!I\xf7(\x06m\xaaE\xbb\xb6\xac\r*\x9b$e<\xb8\xd7\xa2\x0e\x00\xd0l
\x92\xb2\xd5\x15\xcc\xae'\x00\xf4m\x08O'+\xc2y\x9f\x8d\xc9\x15\x80\xfe\x99[q\x96
2@CN|i\xf7\xa9!=\xd7
\xab\x19\x00\xc8\xd6\xb8\xeb\xa1\xf0\xd8l\xca\xfb]\xee\xfb]*\x9fV\xe1\x07\xb7\xc
9\x8b55\xe7M\xef\xb0\x04\xc0\xfd&\x89\x01<\xbe\xf9\x03*\x8a\xf5\x81\x7f\xaa/2y\x
87ks\xec\x1e\xc1\x00\x00\x00\x00IEND\xaeB`\x82">, 'label': <tf.Tensor: shape=(),
dtype=int64, numpy=2>}

At this point you can read the raw TFRecord and decode it as a PNG using a PNG decoder library like Pillow.

The ETL Process for Managing Data in TensorFlow

ETL is the core pattern that TensorFlow uses for training, regardless of scale. We’ve been exploring small-scale, single-computer model building in this book, but the same technology can be used for large-scale training across multiple machines with massive datasets.

The Extract phase of the ETL process is when the raw data is loaded from wherever it is stored and prepared in a way that can be transformed. The Transform phase is when the data is manipulated in a way that makes it suitable or improved for training. For example, batching, image augmentation, mapping to feature columns, and other such logic applied to the data can be considered part of this phase. The Load phase is when the data is loaded into the neural network for training.

Consider the full code to train the Horses or Humans classifier, shown here. I’ve added comments to show where the Extract, Transform, and Load phases take place:

import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_addons as tfa

# MODEL DEFINITION START #
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(16, (3,3), activation='relu', 
                           input_shape=(300, 300, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='Adam', loss='binary_crossentropy', 
              metrics=['accuracy'])
# MODEL DEFINITION END #

# EXTRACT PHASE START #
data = tfds.load('horses_or_humans', split='train', as_supervised=True)
val_data = tfds.load('horses_or_humans', split='test', as_supervised=True)
# EXTRACT PHASE END

# TRANSFORM PHASE START #
def augmentimages(image, label):
  image = tf.cast(image, tf.float32)
  image = (image/255)
  image = tf.image.random_flip_left_right(image)
  image = tfa.image.rotate(image, 40, interpolation='NEAREST')
  return image, label

train = data.map(augmentimages)
train_batches = train.shuffle(100).batch(32)
validation_batches = val_data.batch(32)
# TRANSFORM PHASE END

# LOAD PHASE START #
history = model.fit(train_batches, epochs=10, 
                    validation_data=validation_batches, validation_steps=1)
# LOAD PHASE END #

Using this process can make your data pipelines less susceptible to changes in the data and the underlying schema. When you use TFDS to extract data, the same underlying structure is used regardless of whether the data is small enough to fit in memory, or so large that it cannot be contained even on a simple machine. The tf.data APIs for transformation are also consistent, so you can use similar ones regardless of the underlying data source. And, of course, once it’s transformed, the process of loading the data is also consistent whether you are training on a single CPU, a GPU, a cluster of GPUs, or even pods of TPUs.

How you load the data, however, can have a huge impact on your training speed. Let’s take a look at that next.

Optimizing the Load Phase

Let’s take a closer look at the Extract-Transform-Load process when training a model. We can consider the extraction and transformation of the data to be possible on any processor, including a CPU. In fact, the code used in these phases to perform tasks like downloading data, unzipping it, and going through it record by record and processing them is not what GPUs or TPUs are built for, so this code will likely execute on the CPU anyway. When it comes to training, however, you can get great benefits from a GPU or TPU, so it makes sense to use one for this phase if possible. Thus, in the situation where a GPU or TPU is available to you, you should ideally split the workload between the CPU and the GPU/TPU, with Extract and Transform taking place on the CPU, and Load taking place on the GPU/TPU.

Suppose you’re working with a large dataset. Assuming it’s so large that you have to prepare the data (i.e., do the extraction and transformation) in batches, you’ll end up with a situation like that shown in Figure 4-3. While the first batch is being prepared, the GPU/TPU is idle. When that batch is ready it can be sent to the GPU/TPU for training, but now the CPU is idle until the training is done, when it can start preparing the second batch. There’s a lot of idle time here, so we can see that there’s room for optimization.

The logical solution is to do the work in parallel, preparing and training side by side. This process is called pipelining and is illustrated in Figure 4-4.

In this case, while the CPU prepares the first batch the GPU/TPU again has nothing to work on, so it’s idle. When the first batch is done, the GPU/TPU can start training—but in parallel with this, the CPU will prepare the second batch. Of course, the time it takes to train batch n – 1 and prepare batch n won’t always be the same. If the training time is faster, you’ll have periods of idle time on the GPU/TPU. If it’s slower, you’ll have periods of idle time on the CPU. Choosing the correct batch size can help you optimize here—and as GPU/TPU time is likely more expensive, you’ll probably want to reduce its idle time as much as possible.

You probably noticed when we moved from simple datasets like Fashion MNIST in Keras to using the TFDS versions that you had to batch them before you could train. This is why: the pipelining model is in place so that regardless of how large your dataset is, you’ll continue to use a consistent pattern for ETL on it.

Parallelizing ETL to Improve Training Performance

TensorFlow gives you all the APIs you need to parallelize the Extract and Transform process. Let’s explore what they look like using Dogs vs. Cats and the underlying TFRecord structures.

First, you use tfds.load to get the dataset:

train_data = tfds.load('cats_vs_dogs', split='train', with_info=True)

If you want to use the underlying TFRecords, you’ll need to access the raw files that were downloaded. As the dataset is large, it’s sharded across a number of files (eight, in version 4.0.0).

You can create a list of these files and use tf.Data.Dataset.list_files to load them:

file_pattern = 
    f'/root/tensorflow_datasets/cats_vs_dogs/4.0.0/cats_vs_dogs-train.tfrecord*'
files = tf.data.Dataset.list_files(file_pattern)

Once you have the files, they can be loaded into a dataset using files.interleave like this:

train_dataset = files.interleave(
                     tf.data.TFRecordDataset, 
                     cycle_length=4,
                     num_parallel_calls=tf.data.experimental.AUTOTUNE
                )

There are a few new concepts here, so let’s take a moment to explore them.

The cycle_length parameter specifies the number of input elements that are processed concurrently. So, in a moment you’ll see the mapping function that decodes the records as they’re loaded from disk. Because cycle_length is set to 4, this process will be handling four records at a time. If you don’t specify this value, then it will be derived from the number of available CPU cores.

The num_parallel_calls parameter, when set, will specify the number of parallel calls to execute. Using tf.data.experimental.AUTOTUNE, as is done here, will make your code more portable because the value is set dynamically, based on the available CPUs. When combined with cycle_length, you’re setting the maximum degree of parallelism. So, for example, if num_parallel_calls is set to 6 after autotuning and cycle_length is 4, you’ll have six separate threads, each loading four records at a time.

Now that the Extract process is parallelized, let’s explore parallelizing the transformation of the data. First, create the mapping function that loads the raw TFRecord and converts it to usable content—for example, decoding a JPEG image into an image buffer:

def read_tfrecord(serialized_example):
  feature_description={
      "image": tf.io.FixedLenFeature((), tf.string, ""),
      "label": tf.io.FixedLenFeature((), tf.int64, -1),
  }
  example = tf.io.parse_single_example(
       serialized_example, feature_description
  )
  image = tf.io.decode_jpeg(example['image'], channels=3)
  image = tf.cast(image, tf.float32)
  image = image / 255
  image = tf.image.resize(image, (300,300))
  return image, example['label']

As you can see, this is a typical mapping function without any specific work done to make it work in parallel. That will be done when we call the mapping function. Here’s how to do that:

cores = multiprocessing.cpu_count()
print(cores)
train_dataset = train_dataset.map(read_tfrecord, num_parallel_calls=cores)
train_dataset = train_dataset.cache()

First, if you don’t want to autotune, you can use the multiprocessing library to get a count of your CPUs. Then, when you call the mapping function, you just pass this as the number of parallel calls that you want to make. It’s really as simple as that.

The cache method will cache the dataset in memory. If you have a lot of RAM available this is a really useful speedup. Trying this in Colab with Dogs vs. Cats will likely crash your VM due to the dataset not fitting in RAM. After that, if available, the Colab infrastructure will give you a new, higher-RAM machine.

Loading and training can also be parallelized. As well as shuffling and batching the data, you can prefetch based on the number of CPU cores that are available. Here’s the code:

train_dataset = train_dataset.shuffle(1024).batch(32)
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)

Once your training set is all parallelized, you can train the model as before:

model.fit(train_dataset, epochs=10, verbose=1)

When I tried this in Google Colab, I found that this extra code to parallelize the ETL process reduced the training time to about 40 seconds per epoch, as opposed to 75 seconds without it. These simple changes cut my training time almost in half!

Summary

This chapter introduced TensorFlow Datasets, a library that gives you access to a huge range of datasets, from small learning ones up to full-scale datasets used in research. You saw how they use a common API and common format to help reduce the amount of code you have to write to get access to data. You also saw how to use the ETL process, which is at the heart of the design of TFDS, and in particular we explored parallelizing the extraction, transformation, and loading of data to improve training performance. In the next chapter you’ll take what you’ve learned and start applying it to natural language processing problems.

Get AI and Machine Learning for Coders now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

AI and Machine Learning for Coders by Laurence Moroney

Chapter 4. Using Public Datasets with TensorFlow Datasets

Note

Getting Started with TFDS

Using TFDS with Keras Models

Loading Specific Versions

Using Mapping Functions for Augmentation

Using TensorFlow Addons

Using Custom Splits

Understanding TFRecord

Figure 4-1. Downloading the cnn_dailymail dataset as a TFRecord file

Figure 4-2. Inspecting the TFRecords for cnn_dailymail

The ETL Process for Managing Data in TensorFlow

Optimizing the Load Phase

Figure 4-3. Training on a CPU/GPU

Figure 4-4. Pipelining

Parallelizing ETL to Improve Training Performance

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly