Chapter 1. Introduction to TensorFlow 2
TensorFlow has long been the most popular open source Python machine learning (ML) library. It was developed by the Google Brain team as an internal tool, but in 2015 it was released under an Apache License. Since then, it has evolved into an ecosystem full of important assets for model development and deployment. Today it supports a wide variety of APIs and modules that are specifically designed to handle tasks such as data ingestion and transformation, feature engineering, and model construction and serving, as well as many more.
TensorFlow has become increasingly complex. The purpose of this book is to help simplify the common tasks that a data scientist or ML engineer will need to perform during an end-to-end model development process. This book does not focus on data science and algorithms; rather, the examples here use prebuilt models as a vehicle to teach relevant concepts.
This book is written for readers with basic experience in and knowledge about building ML models. Some proficiency in Python programming is highly recommended. If you work through the book from beginning to end, you will gain a great deal of knowledge about the end-to-end model development process and the major tasks involved, including data engineering, ingestion, and preparation; model training; and serving the model.
The source code for the examples in the book was developed and tested with Google Colaboratory (Colab, for short) and a MacBook Pro running macOS Big Sur, version 11.2.3. The TensorFlow version used is 2.4.1, and the Python version is 3.7.
Improvements in TensorFlow 2
As TensorFlow grows, so does its complexity. The learning curve for new TensorFlow users is steep because there are so many different aspects to keep in mind. How do I prepare the data for ingestion and training? How do I handle different data types? What do I need to consider for different handling methods? These are just some of the basic questions you may have early in your ML journey.
A particularly difficult concept to get accustomed to is lazy execution, which means that TensorFlow doesn’t actually process your data until you explicitly tell it to execute the entire code. The idea is to speed up performance. You can look at an ML model as a set of nodes and edges (in other words, a graph). When you run computations and transform data through the nodes in the path, it turns out that only the computations in the datapath are executed. In other words, you don’t have to calculate every computation, only the ones that lie directly in the path your data takes through the graph from input through output. If the shape and format of the data are not correctly matched between one node and the next, when you compile the model you will get an error. It is rather difficult to investigate where you made a mistake in passing a data structure or tensor shape from one node to the next to debug.
Through TensorFlow 1.x, lazy execution was the way to build and train an ML model. Starting with TensorFlow 2, however, eager execution is the default way to build and train a model. This change makes it much easier to debug the code and try different model architectures. Eager execution also makes it much easier to learn TensorFlow, in that you will see any mistakes immediately upon executing each line of code. You no longer need to build an entire graph of your model before you can debug and test whether your input data is in the right shape. This is one of several major features and improvements that make TensorFlow 2 easier to use than previous versions.
High-level implies that at a lower level there is another framework that actually executes the computation—and this is indeed the case. These low-level frameworks include TensorFlow, Theano, and the Microsoft Cognitive Toolkit (CNTK). The purpose of Keras is to provide easier syntax and coding style for users who want to leverage the low-level frameworks to build deep-learning models.
After Chollet joined Google in 2015, Keras gradually became a keystone of TensorFlow adoption. In 2019, as the TensorFlow team launched version 2.0, it formally adopted Keras as TensorFlow’s first-class citizen API, known as
tf.keras, for all future releases. Since then, TensorFlow has integrated
tf.keras with many other important modules. For example, it works seamlessly with the
tf.io API for reading distributed training data. It also works with the
tf.data.Dataset class, used for streaming training data too big to fit into a single computer. This book uses these modules throughout all chapters.
Today TensorFlow users primarily rely on the
tf.keras API for building deep models quickly and easily. The convenience of getting the training routine working quickly allows more time to experiment with different model architectures and tuning parameters in the model and training routine.
Reusable Models in TensorFlow
Academic researchers have built and tested many ML models, all of which tend to be complicated in their architecture. It is not practical for users to learn how to build these models. Enter the idea of transfer learning, where a model developed for one task is reused to solve another task, in this case one defined by the user. This essentially boils down to transforming user data into the proper data structure at model input and output.
Naturally, there has been great interest in these models and their potential uses. Therefore, by popular demand, many models have become available in the open source ecosystem. TensorFlow created a repository, TensorFlow Hub, to offer the public free access to these complicated models. If you’re interested, you can try these models without having to build them yourself. In Chapter 4, you will learn how to download and use models from TensorFlow Hub. Once you do, you’ll just need to be aware of the data structure the model expects at input, and add a final output layer that is suitable for your prediction goal. Every model in TensorFlow Hub contains concise documentation that gives you the necessary information to construct your input data.
Another place to retrieve prebuilt models is the
tf.keras.applications module, which is part of the TensorFlow distribution. In Chapter 4, you’ll learn how to use this module to leverage a prebuilt model for your own data.
Making Commonly Used Operations Easy
All of these improvements in TensorFlow 2 make a lot of important operations easier and more convenient to implement. Even so, building and training an ML model end to end is not a trivial task. This book will show you how to deal with each aspect of the TensorFlow 2 model training process, starting from the beginning. Following are some of these operations.
Open Source Data
A convenient package integrated into TensorFlow 2 is the TensorFlow dataset library. It is a collection of curated open source datasets that are readily available for use. This library contains datasets of images, text, audio, videos, and many other formats. Some are NumPy arrays, while others are in dataset structures. This library also provides documentation for how to use TensorFlow to load these datasets. By distributing a wide variety of open source data with its product, the TensorFlow team really saves users a lot of the trouble of searching for, integrating, and reshaping training data for a TensorFlow workload. Some of the open source datasets we’ll use in this book are the Titanic dataset for structured data classification and the CIFAR-10 dataset for image classification.
Working with Distributed Datasets
First you have to deal with the question of how to work with training data. Many didactic examples teach TensorFlow using prebuilt training data in its native format, such as a small pandas DataFrame or a NumPy array, which will fit nicely in your computer’s memory. In a more realistic situation, however, you’ll likely have to deal with much more training data than your computer memory can handle. The size of a table read from a SQL database can easily reach into the gigabytes. Even if you have enough memory to load it into a pandas DataFrame or a NumPy array, chances are your Python runtime will run out of memory during computation and crash.
Large tables of data are typically saved as multiple files in common formats such as CSV (comma-separated value) or text. Because of this, you should not attempt to load each file in your Python runtime. The correct way to deal with distributed datasets is to create a reference that points to the location of all the files. Chapter 2 will show you how to use the
tf.io API, which gives you an object that holds a list of file paths and names. This is the preferred way to deal with training data regardless of its size and file count.
How do you intend to pass data to your model for training? This is an important skill, but many popular didactic examples approach it by passing the entire NumPy array into the model training routine. Just like with loading large training data, you will encounter memory issues if you try passing a large NumPy array to your model for training.
A better way to deal with this is through data streaming. Instead of passing the entire training data at once, you stream a subset or batch of data for the model to train with. In TensorFlow, this is known as your dataset. In Chapter 2, you are also going to learn how to make a dataset from the
tf.io object. Dataset objects can be made from all sorts of native data structures. In Chapter 3, you will see how to make a
tf.data.Dataset object from CSV files and images.
With the combination of
tf.data.Dataset, you’ll set up a data handling workflow for model training without having to read or open a single data file in your Python runtime memory.
To make meaningful features for your model to learn the pattern of, you need to apply data- or feature-engineering tasks to your training data. Depending on the data type, there are different ways to do this.
If you are working with tabular data, you may have different values or data types in different columns. In Chapter 3, you will see how to use TensorFlow’s
feature_column API to standardize your training data. It helps you correctly mark which columns are numeric and which are categorical.
For image data, you will have different tasks. For example, all of the images in your dataset must have the same dimensions. Further, pixel values are typically normalized or scaled to a range of [0, 1]. For these tasks,
tf.keras provides the
ImageDataGenerator class, which standardizes image sizes and normalizes pixel values for you.
TensorFlow Hub makes prebuilt, open source models available to everyone. In Chapter 4, you’ll learn how to use the Keras layers API to access TensorFlow Hub. In addition,
tf.keras comes with an inventory of these prebuilt models, which can be called using the
tf.keras.applications module. In Chapter 4, you’ll learn how to use this module for transfer learning as well.
There is definitely more than one way you can implement a model using
tf.keras. This is because some deep learning model architectures or patterns are more complicated than others. For common use, the symbolic API style, which sets up your model architecture sequentially, is likely to suffice. Another style is imperative API, where you declare a model as a class, so that each time you call upon a model object, you are creating an instance of that class. This requires you to understand how class inheritance works (I’ll discuss this in Chapter 6). If your programming background stems from an object-oriented programming language such as C++ or Java, then this API may have a more natural feel for you. Another reason for using the imperative API approach is to keep your model architecture code separate from the remaining workflow. In Chapter 6, you will learn how to set up and use both of these API styles.
Monitoring the Training Process
Monitoring how your model is trained and validated across each epoch (that is, one pass over a training set) is an important aspect of model training. Having a validation step at the end of each epoch is the easiest thing you can do to guard against model overfitting, a phenomenon in which the model starts to memorize training data patterns rather than learning the features as intended. In Chapter 7, you will learn how to use various callbacks to save model weights and biases at every epoch. I’ll also walk you through how to set up and use TensorBoard to visualize the training process.
Even though you know how to handle distributed data and files and stream them into your model training routine, what if you find that training takes an unrealistic amount of time? This is where distributed training can help. It requires a cluster of hardware accelerators, such as graphics processing units (GPUs) or Tensor Processing Units (TPUs). These accelerators are available through many public cloud providers. You can also work with one GPU or TPU (not a cluster) for free in Google Colab; you’ll learn how to use this and the
tf.distribute.MirroredStrategy class, which simplifies and reduces the hard work of setting up distributed training, to work through the example in the first part of Chapter 8.
tf.distribute.MirroredStrategy, the Horovod API from Uber’s engineering team is a considerably more complicated alternative. It’s specifically built to run training routines on a computing cluster. To learn how to use Horovod, you will need to use Databricks, a cloud-based computing platform, to work through the example in the second part of Chapter 8. This will help you learn how to refactor your code to distribute and shard data for the Horovod API.
Serving Your TensorFlow Model
Once you’ve built your model and trained it successfully, it’s time for you to persist, or store, the model so it can be served to handle user input. You’ll see how easy it is to use the
tf.saved_model API to save your model.
Typically, the model is hosted by a web service. This is where TensorFlow Serving comes into the picture: it’s a framework that wraps your model and exposes it for web service calls via HTTP. In Chapter 9, you will learn how to use a TensorFlow Serving Docker image to host your model.
Improving the Training Experience
Finally, Chapter 10 discusses some important aspects of assessing and improving your model training process. You’ll learn how to use the TensorFlow Model Analysis module to look into the issue of model bias. This module provides an interactive dashboard, called Fairness Indicators, designed to reveal model bias. Using a Jupyter Notebook environment and the model you trained on the Titanic dataset from Chapter 3, you’ll see how Fairness Indicators works.
Another improvement brought about by the
tf.keras API is that it makes performing hyperparameter tuning more convenient. Hyperparameters are attributes related to model training routines or model architectures. Tuning them is typically a tedious process, as it involves thoroughly searching over the parameter space. In Chapter 10 you’ll see how to use the Keras Tuner library and an advanced search algorithm known as Hyperband to conduct hyperparameter tuning work.
TensorFlow 2 is a major overhaul of the previous version. Its most significant improvement is designating the
tf.keras API as the recommended way to use TensorFlow. This API works seamlessly with
tf.data.Dataset for an end-to-end model training process. These improvements speed up model building and debugging so you can experiment with other aspects of model training, such as trying different architectures or conducting more efficient hyperparameter searches. So, let’s get started.