Chapter 1. What Is TensorFlow?

Toward the end of 2015, Google released TensorFlow, which started out as just another deep learning library, but has grown to become a de facto standard in enterprise AI technologies (whereas PyTorch grows more in the academic field). TensorFlow received a lot of momentum as part of its initial release, in no small part because it was released by Google, but also because of its breadth of production-quality functionality.

What Is TensorFlow?

TensorFlow is a numerical library written in C++, with Python as one of its primary APIs. Apart from its capabilities on linear algebra computations, it supports automatic differentiation, which is a key functionality needed in machine learning.

The main functional unit of TensorFlow is a static execution graph. Each mathematical operation is represented as node and data is circulating between them as tensors represented by edges—hence, TensorFlow. It is static, since it doesn’t dynamically change at runtime (e.g., for optimizing the order of execution for performance reasons).

Each node in such a graph is an operation, whereas an edge resembles data flow, connecting different operations together.

In the next few sections, we’ll cover basic linear algebra operations that are implemented in TensorFlow, including the most basic operations needed to understand how neural networks are implemented on top of a numerical library like TensorFlow. Unless stated otherwise, TensorFlow 2.x code is used in all examples.

Tip

A graph is a mathematical data structure composed of nodes and edges. A node is often represented by a circle or rectangle where an edge is represented by a line. An edge can have a direction; the lines are replaced by arrows. An edge can also be bidirectional; therefore, arrows with two ends exists. Graphs can be used to model a vast variety of things like social networks, biological metabolic pathways, and transport networks, but also model executions.

On Tensors, Dimensions, Ranks, Orders, Matrices, and Vectors

When it comes to the terms dimensionality, rank, and order, you have to distinguish between the case where the term is describing a vector or matrix, or a tensor. The former are mathematical properties of a vector and matrix; the latter is describing the shape of a data structure called a tensor.

A tensor is a mathematical data structure, a way to organize data and a set of rules (defined in linear algebra) that defines mathematical operations on them.

To be more specific, the shape tells us not only how many dimensions we are dealing with, but also that the expansion on each dimension is quantified. For example, a tensor with shape (2,3,4) has a rank of three with two rows, three columns, and four layers.

The dimensionality, rank, or order of a tensor itself defines its expansion in space. Table 1-1 illustrates this.

Table 1-1. Properties of tensors
Name Dimensionality, rank, or order of tensor Example

Scalar

zero

42

Vector

one

30 36 42

Matrix

two

30 36 42 66 81 96 102 126 150

Cube

three

30 36 42 30 36 42 30 36 42 30 36 42 30 36 42 30 36 42 30 36 42 30 36 42 30 36 42

4D-Hypercube

four

30 36 42 66 81 96 102 126 150 30 36 42 66 81 96 102 126 150 30 36 42 66 81 96 102 126 150 30 36 42 66 81 96 102 126 150 30 36 42 66 81 96 102 126 150 30 36 42 66 81 96 102 126 150 30 36 42 66 81 96 102 126 150 30 36 42 66 81 96 102 126 150 30 36 42 66 81 96 102 126 150

Note

The official TensorFlow documentation uses the term rank exclusively.

Scalars span a one-dimensional vector space since they can unambiguously address any point in it. A one-dimensional vector space is also called a line. This means, with an infinite number of scalar values, you can address any point on a line.

Vectors span a vector space of any dimension, depending on the number of elements in the vector, as it can unambiguously address any point in an arbitrary large (regarding number of dimensions) vector space. The number of elements in a vector corresponds to the number of dimensions, or the order, of the vector space it occupies. This means, with an infinite number of vectors, you can address any point in any vector space, regardless of how many dimensions it has.

Note

A two-dimensional space is also called a plane. Depending on the theory you are following, our universe has 3, 4 (Einstein, space–time), 5 (Theodor Kaluza), or 11 (Edward Witten, super string theory) dimensions. But the MNIST dataset, the “hello world” dataset in neural networks, has 784 dimensions, since the images have 28 by 28 pixels.

Matrices span a vector space of any dimension because, depending on the number of columns, they can unambiguously address any point in an arbitrary large (regarding number of dimensions) vector space. Each row in a matrix corresponds to one point in that space. In other words, a matrix is a collection of points or vectors of the same length. The number of rows tells you how many points you have in that space, and the number of columns tells you how many dimensions that space has.

The order of a matrix is the number of rows (usually mentioned first) and columns (usually mentioned last).

The rank of a matrix is the number of linearly independent components (elements) and is often confused with the order of a matrix.

Three-dimensional tensors span multiple parallel vector spaces of the same dimensionality. Depending on the number of columns, it can unambiguously address any point in an arbitrary large (regarding number of dimensions) vector space. Each row in a three-dimensional tensor corresponds to one point in that space, and each layer in a three-dimensional tensor can be seen as a separate vector space with the same number of dimensions and points.

We haven’t used four-dimensional tensors for any practical application so far, but we’ve heard you need them to understand Einstein’s theory of general relativity.

Element-Wise Addition

The most basic linear algebra operation is adding two vectors, also known as element-wise addition:

a = tf.constant([1., 2., 2., 3.])
b = tf.constant([4., 5., 5., 6.])
c = a + b

Note that we are using Python as a DSL (domain specific language) to interface with the TensorFlow backend. The literal tf.constant creates a tensor object which is backed by a C++ data structure. Also, the memory for this tensor is not allocated on the Python interpreter’s heap memory, it’s allocated on the heap memory; of the backend process, which was implemented in C++. All operations on this tensor expressed in Python are executed by the C++ backend.

As we’ve created two tensors—and as they have dimensionality (rank or order) of one, they are vectors—we can now apply a function on them. In this case, we apply the addition function. The reason this works is due to the operator overloading capabilities of the Python language. Python has a built-in addition operator for number data types like int and float. TensorFlow adds an addition operator for tensors. This way, when using the + operator symbol, it triggers execution of the addition function that TensorFlow brings along.

Therefore, the literal c = a + b triggers execution of the TensorFlow addition function, which applies element-wise addition of the two tensors, defined as:

( 1 , 2 , 2 , 3 ) + ( 4 , 5 , 5 , 6 ) = ( 1 + 4 , 2 + 5 , 2 + 5 , 3 + 6 ) = ( 5 , 7 , 7 , 9 )

Or, generally speaking:

( x 1 , x 2 , . . , x n ) + ( y 1 , y 2 , . . , y n ) = ( x 1 + y 1 , x 2 + y 2 , . . , x n + y n )

Element-Wise Multiplication

Going from addition to multiplication is pretty straightforward:

a = tf.constant([1., 2., 2., 3.])
b = tf.constant([4., 5., 5., 6.])
c = a*b

Again, operator overloading is used to apply the element-wise multiplication function on the two tensors.

The generic form of this operation is:

( x 1 , x 2 , . . , x n ) * ( y 1 , y 2 , . . , y n ) = ( x 1 * y 1 , x 2 * y 2 , . . , x n * y n )

Vector Dot Product

Let’s do something more interesting: computing the vector dot product. Mathematically, this is expressed as:

( x 1 , x 2 , . . , x n ) · ( y 1 , y 2 , . . , y n ) = ( x 1 * y 1 + x 2 * y 2 + . . + x n * y n )

Note that this resembles a linear combination and can be used to express the linear regression machine learning model. Using TensorFlow the code looks as follows:

a = tf.constant([1., 2., 2., 3.])
b = tf.constant([4., 5., 5., 6.])
c = tf.tensordot(a, b, axes=1)

This code uses our two tensors (a and b), and applies the dot product function on it. The result is again assigned to tensor c. Please note that axes is mandatory. It defines how far higher-order tensors will be collapsed. More on that later. Mathematically, the following is computed:

( 1 , 2 , 2 , 3 ) · ( 4 , 5 , 5 , 6 ) = ( 1 * 4 + 2 * 5 + 2 * 5 + 3 * 6 ) = 42

Matrix-Vector Product

If we further climb up the tensor rank ladder, we can add the matrix into the mix. Mathematically, we can compute the dot product between a matrix and a vector as follows:

x 11 , x 12 , . . , x 1n x 21 , x 22 , . . , x 2n . . x m1 , x m2 , . . , x mn · y 1 y 2 . . y n = x 11 * y 1 + x 12 * y 2 + . . + x 1n * y n x 21 * y 1 + x 22 * y 2 + . . + x 2n * y n . . x m1 * y 1 + x m2 * y 2 + . . + x mn * y n

In TensorFlow we implement this as follows:

a = tf.constant([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.]])
b = tf.constant([1., 2., 3.])
c = tf.tensordot(a, b, axes=1)

Please note that a became a matrix instead of a vector. This doesn’t mean we need to change the function that we are applying in order to compute the matrix-vector dot product. The tensordot function is accepting arbitrary types of tensors and, as long as it is mathematically sound, the operation is executed. One example of mathematical soundness is the requirement that the number of columns in the matrix must match the number of elements in the vector.

Tip

A rank in the context of tensors equals the dimensionality or order. But those terms are often confused because they are describing properties of vectors and matrices, and have a different mathematical meaning.

For completeness, let’s have a look at the math for this particular example:

1 , 2 , 3 4 , 5 , 6 7 , 8 , 9 · 1 2 3 = 1 * 1 + 2 * 3 + 3 * 3 4 * 1 + 5 * 2 + 6 * 3 7 * 1 + 8 * 2 + 9 * 3 = 14 32 50

As you can see, taking a matrix-vector dot product is the same as taking a vector-vector dot product, if you consider each row of the matrix to be a vector. Therefore, the result of this computation is a vector that has the same number of elements as there are rows in the matrix.

Matrix-Matrix Product

The matrix-matrix multiplication is defined as follows:

a 11 a 12 a 1n a 21 a 22 a 2n a m1 a m2 a mn · b 11 b 12 b 1p b 21 b 22 b 2p b n1 b n2 b np =

a 11 b 11 + a 12 b 21 + + a 1n b n1 a 11 b 1p + a 12 b 2p + + a 1n b np a 21 b 11 + a 22 b 21 + + a 2n b n1 a 21 b 1p + a 22 b 2p + + a 2n b np a m1 b 11 + a m2 b 21 + + a mn b n1 a m1 b 1p + a m2 b 2p + + a mn b np

Have a look at the individual components on the right-hand side of the equation. If you look at the first column of the result matrix, you’ll notice that its computation corresponds to taking the matrix-vector product of the first matrix on the left-hand side of the equation and multiplying it with the first column of the second matrix, basically treating this column as a vector. So in other words, a matrix-matrix product is a set of matrix-vector products, one for each column of the second matrix.

In TensorFlow, the code looks as follows:

a = tf.constant([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.]])
b = tf.constant([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.]])
c = tf.tensordot(a, b, axes=1)

The result of this code is the following matrix:

30 36 42 66 81 96 102 126 150

Please take some random entries of that matrix and see if you can boil it down to the vector dot product of two vectors.

Understanding the Axes Parameter

The axes parameter of the tensordot function is very interesting. It controls how two tensors are combined to form a new tensor. If axes is set to one, then the behavior described in the previous sections is performed. But the tensordot function offers additional semantics. It controls how far two input tensors can be collapsed, that is, how far the dimensionality of the result is reduced. For example, if axes is zero, no additions take place, only multiplications.

Implementing Machine Learning Algorithms with TensorFlow

As we now have a good grasp of what is going on behind the scenes when using TensorFlow for linear algebra operations, we can use what we’ve learned to implement some machine learning algorithms. Let’s start with linear regression.

Linear Regression Using TensorFlow

The linear regression model is a simple vector dot product. So let’s implement linear regression. We start by creating the data:

import numpy as np
data = np.array(
    [
        [100,35,35,12,0.32],
        [101,46,35,21,0.34],
        [130,56,46,3412,12.42],
        [131,58,48,3542,13.43]
    ]
)

This is a realistic example, since most of the time you’ll get your data in a format that has been used to be stored in a database. Therefore, we need to separate the training data from the label:

x = data[:,1:-1]
y_target = data[:,-1]

We have a NumPy array x containing the training data or features and a NumPy array y_target containing the target value for training.

The weights have to be created. For simplicity, we create a separate bias parameter and reserve the combined version of the model for later:

b = tf.Variable(1,dtype=tf.float64)
w = tf.Variable([1,1,1],dtype=tf.float64)
Tip

As you can see, we are using the tf.Variable constructor to create a TensorFlow variable, here of type resourcevariable, since we are already using TensorFlow 2.x; otherwise the type would be variable—but more on that in Chapter 3.

Using a TensorFlow variable tells TensorFlow at a later stage that the contents of those values have to be changed by an optimizer in order to fit the machine learning model to a dataset. Therefore for all weights, a TensorFlow variable should be used. Now let’s use the weights to express the linear regression model in TensorFlow:

def linear_model(x):
    return b + tf.tensordot(x,w,axes=1)

Again, the mathematical definition of linear regression is w 0 + w 1 x 1 + w 2 x 2 + + w n x n . Note that we’ve replaced w 0 with b for now, because having the bias as part of the weight matrix facilitates parallelization of execution on vector processing units (like GPUs). But this is just notation; mathematically, it’s the same. As you can see, we don’t sum on individual values in the TensorFlow code. We use the vector dot product which linear algebra defines. So, as a repetition, if we have the following vectors w = ( w 1 , w 2 , , w n ) and x = ( x 1 , x 2 , , x n ) , the preceding Python code equals to w 0 + w x .

We are now already done expressing the linear regression model using TensorFlow. But it would be nice to automatically adjust the parameters so that model performs best. Now it will generate random predictions, since we’ve just initialized the parameters randomly.

Training a Model Using TensorFlow

The more you see and hear about the topic, the better, so be prepared to go over the code and explanations multiple times.

Before we can train any model, we need to introduce a tool for measuring how good or how bad we are currently doing given the current values of the weight parameters.

The loss function

Such a tool for measuring our model’s performance is called a cost or loss function. For regression the most prominent loss function is called root mean sum of squared errors and is defined as:

r m s e ( y _ p r e d i c t e d , y _ t a r g e t ) = 1 n i=0 n (y_predicted i -y_target i ) 2

Let’s try to understand what’s going on here. First of all, we need to define what an error is. It’s the difference between the predicted value and the real value: y _ p r e d i c t e d i - y _ t a r g e t i . This error we square for two reasons. First, we make sure that the error is always positive. Second, we penalize larger errors by squaring them. This forces the optimizer to choose the weights accordingly, to favor a larger number of data points having a small error over a small number of data points having a huge error: (y_predicted i -y_target i ) 2 . As you can see, y _ p r e d i c t e d and y _ t a r g e t i are indexed by i. This means we are looking at each individual row in the training table. We use the target value ( y _ t a r g e t i ), and also compute y _ p r e d i c t e d using the x values of the same row of the training table and apply the linear_model() to it. Then we sum over all the squared errors i=0 n . Finally, we just normalize this value by dividing it by the number of rows in our training table, 1 n , and taking the square root of the whole term.

In our example, we use a slight variation of this loss function called mean squared logarithmic error, which gives us a slightly better performance on this dataset. Sometimes it is hard to explain the effect of such choices, and trial and error is an accepted practice in the field. The mean squared logarithmic error function is defined as:

m s l e ( y _ p r e d i c t e d , y _ t a r g e t ) = 1 n i=0 n (log(y_predicted i +1)-log(y_target i +1)) 2

We don’t have to implement it because we can use an implementation that TensorFlow provides:

loss_object = tf.keras.losses.MeanSquaredLogarithmicError()
Tip

The reason why Keras is in the package is because Keras has been integrated into TensorFlow and is now the official high-level API.

The optimizer and gradient descent

The optimizer is the heart of any machine learning training process. The optimizer takes the loss function and adjusts the weights until the loss function is minimized. We’re not covering the internals of the optimizer at this stage, but the following code creates it:

optimizer = tf.keras.optimizers.Adam()

Then, the following code adjusts the weights:

def train_step(x, y):
    with tf.GradientTape() as tape:
        predicted = linear_model(x)
        loss_value = loss_object(y, predicted)
        grads = tape.gradient(loss_value, [b,w])
        optimizer.apply_gradients(zip(grads, [b,w]))

Because this is possibly one of the most low-level ways for training an algorithm, we’re not covering GradientTape in detail, but this is TensorFlow’s way of telling that the variables within the scope of GradientTape are subject to adjustments, and every operation to them should be recorded to the tape such that the first derivative of the overall cost can be computed using auto-differentiation. See Chapter 3, tf.function and AutoGraph for more detail. The train_step function performs a weight adjustment every time it is called, with respect to the gradient obtained from GradientTape. With predicted = linear_model(x) we obtain the current best guess of the model based on the current weight values. Using the current best guess, we compute the error (or loss) value on the real value y and the predicted value of y: loss_value = loss_object(y, predicted). Then, grads = tape.gradient(loss_value, [b,w]) computes some tiny amount by which we have to adjust the weights. Finally, that tiny amount is then used to adjust the weights using the code optimizer.apply_gradients(zip(grads, [b,w])).

Finally, we call the train_step function 1,000 times:

for epoch in range(1000):
            train_step(x, y_target)

After the train_step function has been called 1,000 times, and therefore the weights have been adjusted in tiny steps 1,000 times as well, the loss or error of the model becomes very small and the linear regression model is trained now.

And that’s it! That is the gradient descent algorithm that is used to train nearly any neural network and lots of traditional (non deep learning) machine learning models.

Neural Networks Using TensorFlow

To complete our “Linear Algebra with TensorFlow” journey, let’s create a neural network with TensorFlow. Most of the ingredients we already have.

So let’s have a look at the following code, which implements a single hidden layer neural network:

tf.sigmoid(
   tf.tensordot(
      tf.sigmoid(
         tf.tensordot(
            x,
            w1,
            axes=1)),
      w2,
      axes=1
   )
)
Tip

We are using sigmoid here as a so-called activation function. Each layer in a neural network has an activation function. We’re using sigmoid to make the example as similar as possible to logistic regression. But note that among the most prominent activation functions (which include sigmoid, tanh, softmax, linear, relu, and leaky_relu), based on experience, we recommend using softmax in the output layer for classification problems, and linear in the output layer for regression problems. For all other layers we recommend first trying relu and, to increase performance, try leaky_relu, then all the other activation functions except for sigmoid, softmax, and linear.

To understand this code, let’s separate the two layers:

def layer1(x):
    return tf.sigmoid(tf.tensordot(x,w1,axes=1))

def layer2(x):
    return tf.sigmoid(tf.tensordot(layer1(x),w2,axes=1))

To understand what’s going on, let’s look at the topology of the neural network, which is implemented by the two functions illustrated in Figure 1-1.

wnt2 0101
Figure 1-1. Neural network design

Layer 2 of this neural network exactly corresponds to three logistic regression functions. This means, weight vector w2 has exactly three elements. In layer 1 there are three logistic regression functions, which need to be applied in parallel, because the computed value of each neuron in layer 1 becomes an input of every neuron in layer 2, hence the term fully connected layer. Let’s look at how this is done.

Since layer 2 expects three inputs, the upstream layer 1 needs to create three outputs. Since logistic regression only creates a single output, three logistic regression models—in this case also known as neurons—need to be computed in parallel.

Therefore, weight w1 needs to have nine elements in total: three per neuron, which can also be seen as a node in a network. The ideal data structure for this is a matrix. Let’s look at the already trained weight matrix w1:

[
   [ 0.44021805,  0.44021805,  0.44021805],
   [ 0.70261531,  0.70261531,  0.70261531],
   [-6.3130148 , -6.3130148 , -6.3130148 ]
]

If we now apply the computation expressed by w 1 x , we notice that x is a matrix as well:

[
   [0.00711406, 0.00711406, 0.0024391 ],
   [0.0093499 , 0.00711406, 0.00426843],
   [0.01138249, 0.0093499 , 0.6935188 ],
   [0.01178901, 0.00975642, 0.71994243]
]

Since we know how matrix-matrix multiplication works, it is straightforward to compute the activations (we call the output of neurons activations) of all three neurons of layer 1 for all data points using just a single expression:

tf.sigmoid(tf.tensordot(x,w1,axes=1))

Just for completeness, the result of this computation, given the current weights and data given previously, is:

[
   [0.49818303, 0.49818303, 0.49818303],
   [0.49554206, 0.49554206, 0.49554206],
   [0.01253503, 0.01253503, 0.01253503],
   [0.01063448, 0.01063448, 0.01063448]
]

We can easily cross-check for correctness. Let’s compute the top left element of the activation matrix of layer 1, which is defined as:

s i g m o i d ( x 11 w 1 11 + x 12 w 1 21 + + x 1n w 1 n1 )

This boils down to:

s i g m o i d ( 0 . 00711406 * 0 . 44021805 + 0 . 00711406 * 0 . 70261531 + 0 . 0024391 * - 6 . 3130148 ) = s i g m o i d ( - 0 . 0072678893 ) = 1 1+exp 0.0072678893 = 0 . 498183035673

So we’re lucky: the spot check worked out.

Keras

You’ve seen that we’ve defined individual layer functions. Now we’re going to use layers to define a neural network, as it makes sense not to use tensors and linear algebra, but step up the ladder a bit and use abstract building blocks: layer components instead of weight matrices.

This is where Keras kicks in. Keras was first released on March 27 2015, by François Chollet. Keras consists of a set of predefined layers of all types, which allows for rapidly creating and changing neural networks without worrying about matrixes, vectors, their shapes, and computations. Although other ways exist, Keras allows the defining of a neural network layer by layer, computation by computation. Therefore, in Keras, an activation function can be expressed as an individual layer. Let’s reproduce the example above using Keras:

model = Sequential()
model.add(Dense(3, input_shape=(3,), activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))

The first thing you notice is that there are no tensors involved in this definition. You start the model definition process by instantiating an object from the Sequential class. Then you add layers. We stick with the fully connected version we’ve been using so far for now. In Keras those are called dense. The first argument to the constructors is the number of neurons, so, again, we start with layer 1 having three neurons here. Layer 1 needs to know the shape of the input data; therefore, we specify that we have three input columns (or dimensions). Then we specify the activation function, and again, we are using sigmoid here. To follow our previous examples, we specify another layer with a single neuron, also applying the sigmoid function. So that’s it—our neural network is defined. This completely removes the burden to understand (and cognitively process) linear algebra semantics during definition and execution of neural networks.

Now we just need to add some training-specific code and we are done:

model.compile(optimizer=SGD(learning_rate=0.1),
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(x, y_target, epochs=1000,
          verbose=1)

Here again, we specify the optimizer and the loss function and finally train for 1,000 epochs.

Tip

An epoch, in neural network training, is equivalent to presenting the whole training dataset to the neural network exactly once.

Summary

You’ve learned about two important APIs that TensorFlow provides: the low-level, tensor-based API and the high-level, layer-based API (Keras). In addition, as you’ve learned about how the TensorFlow low-level API works, you’ve also been introduced to the most fundamental concepts of linear algebra. This will greatly help you to understand any neural network code.

Get What's New In TensorFlow 2.x? now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.