Connectors.
Connectors. (source: Pixabay)

TensorFlow has gathered quite a bit of attention as the new hot toolkit for building neural networks. To the beginner, it may seem the only thing that rivals this interest is the number of different APIs that you can use. In this article, we go over a few of them, building the same neural network each time. We start with low-level TensorFlow math, and then show how to simplify that code with TensorFlow's layer API. We also discuss two libraries built on top of TensorFlow: TFLearn and Keras.

The MNIST database is a collection of handwritten digits. Each is recorded in a 28x28 pixel grayscale image. We build a two-layer perceptron network to classify each image as a digit from zero to nine. The first layer will fully connect the 784 inputs to 64 hidden neurons, using a sigmoid activation. The second layer will connect those hidden neurons to 10 outputs, scaled with the softmax function. The network will be trained with stochastic gradient descent, on minibatches of 64, for 20 epochs. (These values are chosen not because they are the best, but because they produce reasonable results in a reasonable time.)

We'll start by loading the modules and the data, as well as setting up some constants we'll use repeatedly.

import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets('/tmp/data', one_hot=True)

Xtrain = mnist.train.images
ytrain = mnist.train.labels
Xtest = mnist.test.images
ytest = mnist.test.labels

N_PIXELS = 28 * 28
N_CLASSES = 10
HIDDEN_SIZE = 64
EPOCHS = 20
BATCH_SIZE = 64

sess = tf.Session()

Raw TensorFlow

At its heart, TensorFlow is just a tool for assembling and evaluating computational graphs. Thus, the most basic way to use TensorFlow is to set up the calculation by hand.

Let's start by setting up placeholders for the features and labels. These record the shape and datatype of that data to be fed in. Note that the first dimension has size None, which indicates that it can take an arbitrary number of observations.

x = tf.placeholder(tf.float32, [None, N_PIXELS], name="pixels")
y_label = tf.placeholder(tf.float32, [None, N_CLASSES], name="labels")

In the first layer, the input features (pixel intensities) are multiplied by a weight matrix of size N_PIXELS x HIDDEN_SIZE. The weights are stored in a variable, which is a TensorFlow data structure that holds state that can be updated during the training.

A bias term is added to this, and the result is sent through a sigmoid activation function.

W1 = tf.Variable(tf.truncated_normal([N_PIXELS, HIDDEN_SIZE],
                                     stddev=N_PIXELS**-0.5))
b1 = tf.Variable(tf.zeros([HIDDEN_SIZE]))

hidden = tf.nn.sigmoid(tf.matmul(x, W1) + b1)

The second layer has its own set of weights and biases, sized to give us 10 outputs, one for each class. We don't apply an activation function to this output...

W2 = tf.Variable(tf.truncated_normal([HIDDEN_SIZE, N_CLASSES],
                                     stddev=HIDDEN_SIZE**-0.5))
b2 = tf.Variable(tf.zeros([N_CLASSES]))

y = tf.matmul(hidden, W2) + b2

...because TensorFlow provides a loss function that includes the softmax activation. (Doing it this way allows it to avoid floating-point issues for probabilities close to 0 or 1.) This loss function calculates the cross entropy directly from the logits, the input to the softmax function. The ground truth values will be input as y_labels.

loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(logits=y, labels=y_label))

The cross entropy is useful for training because it rewards steps that improve the confidence of predictions, even if they don't change the actual predictions. It can be a bit difficult to understand, so we'll also compute the accuracy, the fraction of predictions we got correct.

accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(y, 1),
                                           tf.argmax(y_label, 1)),
                                  tf.float32))

All that's left to do is run the training process. Gradient descent is a simple optimization scheme that updates the value of parameter based on the gradient of a loss function with respect to that parameter. Because TensorFlow is working from a computational graph, it can work out all the variables that contribute to the loss tensor, and it can figure out how to update those variables to reduce to value of loss. Those update rules are stored in sgd.

It is up to us to run these update rules a number of times. We've chosen to run for 20 epochs (cycles through the full training data), with randomly chosen batches of 64 training data for each step. After each epoch, we print out the loss and accuracy of the model on the test data.

sgd = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
sess.run(tf.global_variables_initializer())
inds = range(Xtrain.shape[0])

for i in xrange(EPOCHS):
    np.random.shuffle(inds)
    for j in xrange(0, len(inds), BATCH_SIZE):
        sess.run(sgd, feed_dict={x: Xtrain[inds[j:j+BATCH_SIZE]],
                                 y_label: ytrain[inds[j:j+BATCH_SIZE]]})

    print sess.run([loss, accuracy], feed_dict={x: Xtest, y_label: ytest})

[0.25975966, 0.92549998]
[0.21270864, 0.9357]
[0.19168039, 0.9404]
[0.15652786, 0.95499998]
[0.1340636, 0.96210003]
[0.12281641, 0.96340001]
[0.11627591, 0.96520001]
[0.10932469, 0.96569997]
[0.10292228, 0.9684]
[0.10111527, 0.96820003]
[0.095439233, 0.97079998]
[0.098685279, 0.97079998]
[0.091198094, 0.97259998]
[0.087996222, 0.97399998]
[0.089517325, 0.972]
[0.087211363, 0.97350001]
[0.08782991, 0.97359997]
[0.084821992, 0.9756]
[0.082820401, 0.97539997]
[0.08967527, 0.97439998]

This runs in about 20 seconds on an unremarkable laptop, and gets us an accuracy of over 97%.

The Layer API

This was quite a bit of work for a relatively simple network. Each layer required us to set up weight and bias variables of the right shape, do some matrix math, and apply an activation function. That work will basically be the same each time we need a new layer, so the TensorFlow Layers API abstracts that work into a single function call.

We use the same placeholders, x and y_label, as before. Now, we can create the hidden layer with a single line.

hidden = tf.layers.dense(x, HIDDEN_SIZE,
                         activation=tf.nn.sigmoid,
                         use_bias=True,
                         kernel_initializer=tf.truncated_normal_initializer(stddev=N_PIXELS**-0.5))

Because TensorFlow knows the shape of x, it can work out the size of the weight matrix that is needed. With use_bias=True, bias variables are created as well. The activation function can be specified, and the kernel_initializer gives a function to initialize the weight matrix. (The bias is initialized to zero by default.)

The output layer works much the same way, with the exception of no activation being applied. (Once again, we'll use the loss function that applies the softmax activation itself.)

y = tf.layers.dense(hidden, N_CLASSES,
                    activation=None,
                    use_bias=True,
                    kernel_initializer=tf.truncated_normal_initializer(stddev=HIDDEN_SIZE**-0.5))

The loss and accuracy are defined in the same way as before.

loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(logits=y, labels=y_label))
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(y, 1),
                                           tf.argmax(y_label, 1)),
                                  tf.float32))

We are still responsible for running the minimization process by hand. The code is identical to the previous example, as is the performance.

sgd = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
sess.run(tf.global_variables_initializer())
inds = range(Xtrain.shape[0])

for i in xrange(EPOCHS):
    np.random.shuffle(inds)
    for j in xrange(0, len(inds), BATCH_SIZE):
        sess.run(sgd, feed_dict={x: Xtrain[inds[j:j+BATCH_SIZE]],
                                 y_label: ytrain[inds[j:j+BATCH_SIZE]]})

    print sess.run([loss, accuracy], feed_dict={x: Xtest, y_label: ytest})

[0.25065958, 0.92699999]
[0.19859588, 0.94059998]
[0.16099697, 0.95230001]
[0.14838797, 0.95529997]
[0.12807436, 0.96179998]
[0.1130126, 0.9659]
[0.11923214, 0.96280003]
[0.10029678, 0.9691]
[0.094030492, 0.97079998]
[0.093222573, 0.97180003]
[0.093066469, 0.97109997]
[0.090347208, 0.97219998]
[0.084357627, 0.9727]
[0.084034808, 0.97359997]
[0.080353431, 0.97479999]
[0.079443201, 0.97469997]
[0.080090113, 0.97659999]
[0.077046707, 0.97640002]
[0.078963876, 0.9752]
[0.077811748, 0.97640002]

TFLearn

The layer API still requires us to deal with low-level details of the optimization scheme. A number of projects attempt to provide a higher-level syntax, more reminiscent of Sci-kit Learn estimators. One such project is TFLearn (which should not be confused with tensorflow.contrib.learn).

import tflearn

With TFLearn, we don't have to worry about setting up placeholders or variables to hold values. Instead, we create a structure for our input features with input_data.

x = tflearn.input_data(shape=[None, N_PIXELS], name="pixels")

As with the layer API, we can create each layer with a single function call. As before, we must specify the input tensor, the number of neurons, and the activation function. Note that we include the softmax activation function on the output layer in this case.

hidden = tflearn.fully_connected(x, HIDDEN_SIZE, activation="sigmoid")
y = tflearn.fully_connected(hidden, N_CLASSES, activation="softmax")

The tflearn.regression layer abstracts away many of the details of the regression model. Instead of creating our own loss function, accuracy measure, and optimization step, we simply specify that the network should be optimizing "categorical_crossentropy" using a stochastic gradient descent technique.

network = tflearn.regression(y,
                             optimizer=tflearn.SGD(learning_rate=0.5),
                             loss="categorical_crossentropy")

Finally, we create a model from this network. This model has the .fit() and .predict() methods that we're used to from Sci-kit Learn. In addition to the training data, the fit method accepts other arguments specifying the details of the optimization scheme. By including a validation set, we get reports on the model's performance on the test data once per epoch.

model = tflearn.DNN(network)
model.fit(Xtrain, ytrain,
          n_epoch=EPOCHS,
          batch_size=BATCH_SIZE,
          validation_set=(Xtest, ytest),
          show_metric=True)

Training Step: 17199 | total loss: [1m[32m0.40548[0m[0m | time: 3.261s
| SGD | epoch: 020 | loss: 0.40548 - acc: 0.9569 -- iter: 54976/55000
Training Step: 17200 | total loss: [1m[32m0.36858[0m[0m | time: 4.381s
| SGD | epoch: 020 | loss: 0.36858 - acc: 0.9612 | val_loss: 0.08270 - val_acc: 0.9739 -- iter: 55000/55000
--

The performance is very similar to the previous approaches, with a validation cross entropy of about 0.08 and 97% accuracy. There are some small differences due to different initializations of the weights, as well as the random choice of batches, but the underlying algorithm is the same.

Keras

Like TFLearn, Keras provides a high-level API for creating neural networks. It is back end agnostic, running on top of CNTK and Theano in addition to TensorFlow. Nonetheless, it was recently added to the tensorflow.contrib namespace.

from tensorflow.contrib import keras

In Keras, we start with the model object. This specifies how the layers should be laid out. Here, a Sequential model indicates that the layers are to be connected in order.

model = keras.models.Sequential()

The layers are added to the model in order. We need to specify the input dimension on the first layer, but Keras is able to work out the input dimension to the second layer from the output size of the first.

model.add(keras.layers.Dense(HIDDEN_SIZE, activation='sigmoid', input_dim=N_PIXELS))
model.add(keras.layers.Dense(N_CLASSES, activation='softmax'))

The compilation step prepares the model for training, recording the loss function, the optimization scheme, and additional metrics to measure.

model.compile(loss='categorical_crossentropy',
              optimizer=tf.train.GradientDescentOptimizer(0.5),
              metrics=['accuracy'])

Then we can fit the model on the training data.

model.fit(Xtrain, ytrain,
          epochs=EPOCHS,
          batch_size=BATCH_SIZE,
          validation_data=(Xtest, ytest))

Train on 55000 samples, validate on 10000 samples
Epoch 20/20
55000/55000 [==============================] - 2s - loss: 0.0382 - acc: 0.9908 - val_loss: 0.0810 - val_acc: 0.9748

Unsurprisingly, the performance is essentially identical to what we’ve seen before.

Conclusion

It's nice to know the power of raw TensorFlow is available, but most of the time, you'll want a more succinct syntax. The TensorFlow layer API simplifies the construction of a neural network, but not the training. TFLearn and Keras offer two choices for a higher-level API that hides some of the details of training. The Keras API is a bit more object oriented than the TFLearn API, but their capabilities are similar. Keras’s adoption into the TensorFlow project suggests a bright future for the project, but TFLearn is going strong itself. In the end, choose the API that works best for you.

Article image: Connectors. (source: Pixabay).