Chapter 1. Introduction to TensorFlow 2
TensorFlow has long been the most popular open-source Python machine learning library. It was developed by the Google Brain team as an internal tool, but in 2015 it was released under an Apache License. Since then, it has evolved into an ecosystem full of important assets for model development and deployment. Today, it supports a wide variety of APIs and modules that are specifically designed to handle tasks such as data ingestion and transformation, feature engineering, and model construction and serving, as well as many more.
TensorFlow has become increasingly complex. The purpose of this book is to help simplify the common tasks that a data scientist or machine learning engineer will need to perform during an end-to-end model development process. This book does not focus on data science and algorithms; rather, the examples here use pre-built models as a vehicle to teach relevant concepts.
This book is written for readers with basic experience in and knowledge about building machine learning models. Some proficiency in Python programming is highly recommended. The content in each chapter typically stands alone; there is not much dependency between previous chapters, so you can jump around as needed. If you do work through the book from beginning to end, you will gain a great deal of knowledge about the end-to-end model development process and the major tasks involved, including data engineering, ingestion, and preparation, model training, and serving the model.
Improvements in TensorFlow 2
As TensorFlow grows, so does its complexity. The learning curve for new TensorFlow users is slow, because there are so many different aspects to keep in mind. How do I prepare the data for ingestion and training? How do I handle different data types? What do I need to consider for different handling method? These are just some of basic questions you may have early in your machine learning journey.
A particularly difficult concept to get accustomed to is lazy execution, which means that TensorFlow doesn’t actually process your data until you explicitly tell it to execute the entire code. The idea is to speed up performance. You can look at a machine learning model as a set of nodes and edges (in other words, a graph). When you run computations and transform data through the nodes in the path, it turns out that only the computations in the datapath are executed. In other words, you don’t have to calculate every computation, only the ones that lie directly in the path your data takes through the graph from input through output. If the shape and format of the data are not correctly matched between one node and the next, when you compile the model, you will get an error. It is rather difficult to investigate where you made a mistake in passing data structure or tensor shape from one node to the next, in order to debug.
Through TensorFlow 1.x, lazy execution is the way to build and train a machine learning model. Starting with TensorFlow 2, however, eager execution is the default way to build and train a model. This change makes it much easier to debug the code and try different model architectures. Eager execution also makes it much easier to learn TensorFlow, in that you will see any mistakes immediately upon executing each line of code. You no longer need to build an entire graph of your model before you can debug and test whether your input data is in the right shape. This is one of several major features and improvements that make TensorFlow 2 easier to use than the previous version.
Keras, created by AI researcher François Chollet, is an open-source high-level deep learning API or framework. It is compatible with multiple machine learning libraries.
High-level implies that at a lower level, there is another framework that actually executes the computation—and this is indeed the case. These low-level frameworks include TensorFlow, Theano, and Cognitive Toolkit (CNTK), developed by Microsoft. The purpose of Keras is to provide easier syntax and coding style for users who want to leverage the low-level frameworks to build deep learning models.
After Chollet joined Google in 2015, Keras gradually became a keystone of TensorFlow adoption. In 2019, as the TensorFlow team launched version 2.0, they formally adopted Keras as TensorFlow’s first-class citizen API, known as
tf.keras, for all future releases. Since then, TensorFlow has integrated
tf.keras with many other important modules. For example, it works seamlessly with the
tf.io module for reading distributed training data. It also works with the
tf.dataset module, used for streaming training data too big to fit into a single computer. This book uses these modules throughout all chapters.
Today, TensorFlow users primarily rely on the
tf.keras API for building deep models quickly and easily. The convenience of getting the training routine working quickly allows more time to experiment with different model architectures and tuning parameters in the model and training routine.
Reusable models in TensorFlow
Academic researchers have built and tested many different machine learning models, all of which tend to be very complicated in their architecture. It is not practical for users to learn how to build these models. Enter the idea of transfer learning, where a model developed for one task is reused to solve another task, in this case one defined by the user. This essentially boils down to transforming user data to the proper data structure at model input and output.
Naturally, there has been great interest in these models and their potential uses. Therefore, by popular demand, many models have become available in the open-source ecosystem.
TensorFlow created a repository, TensorFlow Hub, to offer the public free access to these complicated models. If you’re interested, you can try these models without having to build them yourself. In Chapter 4, you will learn how to download and use models from TensorFlow Hub. Once you do, you’ll just need to be aware of the data structure the model expects at input, and add a final output layer that is suitable for your prediction goal. Every model in TensorFlow Hub contains concise documentation that gives you the necessary information to construct your input data.
Another place to retrieve pre-built models is the
tf.keras.application module. This module is also a part of the TensorFlow distribution. In Chapter 4 you will learn how to use this module to leverage a pre-built model for your own data.
Making table stake operations easy
All of these improvements in TensorFlow 2 make a lot of important operations easier and more convenient to implement. Even so, building and training a machine learning model end-to-end is not a trivial task. This book will show you how to deal with each aspect of the TensorFlow 2 model training process, starting from the beginning.
Working with distributed datasets
First, you have to deal with the question of how to work with training data. Many didactic examples teach TensorFlow using pre-built training data in its native format, such as a small pandas dataframe or a numpy array, which will fit nicely in your computer’s memory. In a more realistic situation, however, you’ll likely have to deal with much more training data than your computer memory can handle. The size of a table read from a SQL database can easily reach into the gigabytes. When you try to load it into a pandas dataframe or a numpy array, even if you have enough memory, chances are your Python runtime will run out of memory during computation and crash.
Large tables of data are typically saved as multiple files in common formats such as CSV or text. Because of this, you should not attempt to load each file in your Python runtime. The correct way of dealing with distributed datasets is to create a reference that points to the location of all the files.
For this reason, Chapter 2 will show you how to use the
tf.io module, which gives you an object that holds a list of file paths and names. This is the preferred way to deal with training data regardless of its size and file count.
How do you intend to pass data to your model for training? This is an important skill, but many popular didactic examples approach it by passing the entire numpy array into the model training routine. Just like with loading large training data, you will encounter memory issues if you try passing a large numpy array to your model for training.
A better way to deal with this is through data streaming. Instead of passing the entire training data at once, you stream a subset or batch of data for the model to train with. In TensorFlow, this is known as your ‘dataset’. In Chapter 2, you are also going to learn how to make a dataset from the
tf.io object. Dataset objects can be made from all sorts of native data structures. In Chapter 3, you will see how to make a
tf.dataset object from CSV and images.
With the combination of
tf.dataset, you’ll set up a data handling workflow for model training without having to read or open a single data file in your Python runtime memory.
To make meaningful features for your model to learn the pattern, you need to apply data or feature-engineering tasks to your training data. Depending on the data type, there are different ways to do this.
If you are working with tabular data, you may have different values or data types in different columns. In Chapter 3, you will see how to use TensorFlow’s
feature_column API to standardize your training data. It helps you correctly mark which columns are numeric and which are categorical.
For image data, you will have different tasks. For example, all of the images in your dataset must have the same dimensions. Further, pixel values are typically normalized or scaled to a range of [0, 1]. For these tasks,
tf.keras provides the
ImageGenerator API, which standardizes image sizes and normalizes pixel values for you.
TensorFlow Hub makes prebuilt open-source models available to everyone. In Chapter 4, you’ll learn how to use the
KerasLayer API to access TensorFlow Hub. In addition,
tf.keras comes with an inventory of these prebuilt models, which can be called using the
tf.keras.applications API. In Chapter 4 I’ll show you how to use this API for transfer learning as well.
There is definitely more than one way you can implement a model using
tf.keras. This is because some deep learning model architecture or patterns are more complicated than others. For common use, the symbolic API style, which sets up your model architecture sequentially, is likely to suffice. Another style is imperative API, where you declare a model as a class, so that each time you call upon a model object, you are creating an instance of that class. This requires you to understand how class inheritance works (I’ll discuss this in Chapter 6). But if your programming background stems from object-oriented programming language such as C++ or Java, then this API may have a more natural feel for you. Another reason for using the imperative API approach is to keep your model architecture code separate from the remaining workflow. In Chapter 6, you will learn how to set up and use both APIs.
Monitoring the training process
Monitoring how your model is trained and validated across each epoch is an important aspect of model training. Having a validation step at the end of each epoch is the easiest thing you can do to guard against model overfitting, a phenomenon in which the model starts to memorize training data patterns rather than learning the features as intended. In “Monitoring the training process”, you will learn how to use various callbacks to save model weights and biases at every epoch. I’ll also walk you through how to set up and use TensorBoard to visualize the training process.
Even though you know how to handle distributed data and files and stream them into your model training routine, what if you find that training takes an unrealistic amount of time? This is where distributed training can help. It requires a cluster of hardware accelerators, such as graphic processing units (GPUs) or tensor processing units (TPUs). These accelerators are available through many public cloud providers. You can also work with one GPU or TPU (not a cluster) for free in Google Colab; I’ll show you how to use this and the
tf.distribute.Strategy API.to work through the example in the first part of “Distributed training ”. This API really simplifies a lot of nuances and reduces the hard work involved in setting up distributed training.
A considerably more complicated alternative to the
tf.distribute.Strategy API is the
Horovod API, open sourced by Uber’s engineering team. It was released before the
tf.distribute.Stragy API and is specifically built to run training routines on a computing cluster. To learn how to use Horovod, you will need to use Databricks, a cloud-based computing platform, to work through the example in the second part of “Distributed training ”. This will help you learn how to refactor your code to distribute and shard data for the Horovod API.
Serving your TensorFlow model
Once you’ve built your model and trained it successfully, it’s time for you to persist or store the model, so it can be served to handle user input. You’ll see how easy it is to use the
tf.saved_model API to save your model.
Typically, the model is hosted by a web service. This is where TensorFlow Serving comes into the picture: it’s a framework that wraps your model and exposes it for web service calls via HTTP protocol. In Chapter 9, you will learn how to use a TensorFlow Serving Docker image to host your model.
Improving the training experience
Finally, Chapter 10 discusses some important aspects of assessing and improving your model training process. I’ll show you how to use the TensorFlow Model Analysis module to look into the issue of model bias. This module provides an interactive dashboard designed to reveal model bias, called the Fairness Indicator. Using a Jupyter Notebook environment and the model you trained on the Titanic dataset from Chapter 3, you’ll see how Fairness Indicator works.
Another improvement brought about by the
tf.keras API is that it makes performing hyperparameter tuning more convenient. Hyperparameters are attributes related to model training routines or model architectures. Tuning them is typically a tedious process, as it involves thoroughly searching over the parameter space. In Chapter 10 you’ll see how to use the
kerastuner library and an advanced search algorithm known as Hyperband to conduct hyperparameter tuning work.
TensorFlow 2 is a major overhaul of the previous version, which has been greatly improved—mostly by designating the
tf.keras API as the recommended way to use TensorFlow. In addition, this API works seamlessly with the
tf.dataset APIs for an end-to-end model training process. These improvements speed up model building and debugging so you can experiment with other aspects of model training, such as trying different architectures or conducting more efficient hyperparameter searches. So let’s get started.