Chapter 4. Using Public Datasets with TensorFlow Datasets
In the first chapters of this book you trained models using a variety of data, from the Fashion MNIST dataset that is conveniently bundled with Keras to the image-based Horses or Humans and Dogs vs. Cats datasets, which were available as ZIP files that you had to download and preprocess. You’ve probably already realized that there are lots of different ways of getting the data with which to train a model.
However, many public datasets require you to learn lots of different domain-specific skills before you begin to consider your model architecture. The goal behind TensorFlow Datasets (TFDS) is to expose datasets in a way that’s easy to consume, where all the preprocessing steps of acquiring the data and getting it into TensorFlow-friendly APIs are done for you.
You’ve already seen a little of this idea with how Keras handled Fashion MNIST back in Chapters 1 and 2. As a recap, all you had to do to get the data was this:
data=tf.keras.datasets.fashion_mnist(training_images,training_labels),(test_images,test_labels)=data.load_data()
TFDS builds on this idea, but greatly expands not only the number of datasets available but the diversity of dataset types. The list of available datasets is growing all the time, in categories such as:
- Audio
- Speech and music data
- Image
- From simple learning datasets like Horses or Humans up to advanced research datasets for uses such as diabetic retinopathy detection
- Object detection ...