O'Reilly logo

Practical Deep Learning for Cloud and Mobile by Meher Kasam, Siddha Ganju, Anirudh Koul

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Image Classification with Keras

A Note for Early Release Readers

This will be the 2nd chapter of the final book.

If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material, please reach out to the authors at practicaldlbook@gmail.com.

Now that we have started our journey into deep learning, let’s get our hands a little dirty.

If you have skimmed through deep learning literature, you may have come across a barrage of confusing academic explanations laced with scary mathematics. Don’t worry. We will ease you into practical deep learning by showing how easily you can classify images with just a few lines of code.

In this chapter, we will introduce Keras, discuss its place in the deep learning landscape, and then use it to classify a few images using existing state-of-the-art classifiers. We will visually investigate how these classifiers operate by using heatmaps. With these heatmaps, we’ll make a fun project where we classify objects in videos.

Where’s the theory behind this, you might wonder? That will come later. Using this chapter as a foundation, we will delve deeper into the nuts and bolts of convolutional neural networks in the chapters that follow. After all, there’s no better way to learn about and appreciate the components of a system than to dive right in and use them!

Introduction to Keras

Keras is a high-level neural network API designed to provide a simplified abstraction layer above several deep learning libraries such as TensorFlow, Theano, CNTK, PlaidML, MXNet, and more. This abstraction makes it easier and quicker to code deep neural networks with Keras than using the libraries themselves. While beginner-friendly, Keras has enough functionality for quick prototyping and even professional-level, heavy-duty training. In this book, we will primarily use Keras with a TensorFlow backend.

Layers of Abstraction

One can draw parallels between the layers of abstraction in deep learning and those in computer programming. Much like how a computer programmer could write code in machine language (theoretically albeit painfully), assembly language, or higher-level languages, a deep learning practitioner can write training and inference programs using low-level frameworks such as CUDA, libraries like TensorFlow, or high-level frameworks such as Keras. In both cases, greater abstraction means greater ease of use, at the expense of flexibility.

Here are the building blocks for most deep learning libraries running on NVIDIA GPUs. The higher the level, the higher the abstraction.

dlcm 0201
Figure 1-1. Levels of abstraction for different libraries. Abstraction increases in the direction of the arrows.

Since NVIDIA is the leading provider of GPUs used for deep learning, it provides drivers, CUDA, and cuDNN. Drivers help the operating system interface with the hardware GPU. CUDA, which stands for Compute Unified Device Architecture, provides direct access to the virtual instruction set of the GPU and the ability to execute parallel compute kernels. The CUDA Deep Neural Network library or cuDNN, which is built on top of CUDA, provides highly tuned implementations for standard routines and primitives for deep neural networks, such as convolution, pooling, normalization, and activation layers. Deep learning libraries like TensorFlow reuse these primitives and provide the inference engine (i.e. the system to compute predictions from the data). Finally, Keras provides another level of abstraction to further compact the necessary code to make this model.

Higher-level abstractions provide a form of leverage: you can do more with fewer lines of code. Let’s test this theory out by comparing Keras with TensorFlow on one of the most famous tasks in deep learning: training a handwritten digit classifier (on the MNIST dataset) using a convolutional neural network. Using publicly available code in tutorials, we stripped off everything except for the core code and found that Keras requires roughly half the keystrokes when compared to TensorFlow code for the same task, as shown in Table 1-1.

Table 1-1. Example showing lines of code and character count at two abstraction levels. Higher levels of abstraction permit the same work to be accomplished with fewer lines and characters.
Library Line count Character count (no spaces) Avg. character count per line

TensorFlow

31

2162

70

Keras

22

1018

46

In addition to being easier to use, Keras is quite popular within the open-source community. A good measure of an open-source project’s popularity is the number of people who contribute to its codebase. As of March 2018, the following is a comparison of Keras to other libraries on GitHub:

Table 1-2. Stars and contributions to each framework’s GitHub repo. It’s worth remembering that many contributors to TensorFlow are Googlers, while Keras is a lot more “grassroots,” with a diverse contributor base.
Library Stars Contributors

tensorflow/tensorflow

92150

1357

fchollet/keras

26744

638

BVLC/caffe

23159

264

Microsoft/CNTK

13995

173

dmlc/mxnet

13318

492

pytorch/pytorch

12835

414

deeplearning4j/deeplearning4j

8472

140

caffe2/caffe2

7540

176

Since its inception in 2015, Keras has consistently attracted more users, quickly becoming the framework of choice for deep learning after TensorFlow. Due a large user base and the open-source development community behind it, you can readily find many examples of tasks on Github and other documentation sources, making it easy for beginners to learn Keras. It is also versatile, allowing various deep learning backends to be used for training (like TensorFlow, CNTK, Theano, etc.) so it does not lock you into a specific ecosystem. Keras is therefore quite ideal for anyone who is making their foray into deep learning.

Keras in Practice

Predicting an Image’s Category

As we covered in Chapter 1, image classification answers the question “does the image contain X?” where X can be virtually any category or class of objects. The process can be broken down into the following steps:

  1. Load an image

  2. Resize it to 224×224

  3. Normalize the values of the pixel to the range [−1,1] a.k.a preprocessing

  4. Select a pre-trained model

Here’s some sample code for predicting categories of an image, which uses some of the helpful functions that Keras provides in its modules. As you do more coding, you’ll often find that the layer or pre-processing step you need is already implemented in Keras, so remember to read the documentation.

In the GitHub repo, navigate to code/chapter2. All the steps we will be following are also detailed in the Jupyter notebook ‘1_predict_class.ipynb’.

We start by importing all the necessary modules from the Keras and Python packages.

from keras.preprocessing.image import ImageDataGenerator
from keras.models import Model
from keras.layers import Input, Flatten, Dense, Dropout, GlobalAveragePooling2D
from keras.applications.resnet50 import ResNet50
import keras
from keras.applications.imagenet_utils import preprocess_input, decode_predictions
from keras.preprocessing import image
import numpy as np
import matplotlib.pyplot as plt

Next, we load and display the image that we want to classify.

img_path = '../../sample_images/cat.jpg'
img = image.load_img(img_path, target_size=(224, 224))
plt.imshow(img)
plt.show()
dlcm 0202
Figure 1-2. Plot showing the contents of the file cat.jpg.

It’s a cat! And that’s what our model should ideally be predicting.

Note

A Brief Refresher on Images

Before we dive into how images are processed, it would be good to take a look at how images store information. At the most basic level, an image is a collection of pixels that are laid out in a rectangular grid. Depending on the type of image, each pixel can consist of 1 to 4 parts (also known as components or channels). With the image we will be using, these components represent the intensities of Red, Green and Blue colors (RGB). They are typically 8 bits in length, so their values range between 0 and 255 (i.e. 28 -1).

In machine learning, it is empirically shown that taking an input within an arbitrary range and scaling it to the interval [0,1] or [−1,1] improves its chances of learning. This step is commonly referred to as normalization. Normalizing is one of the core steps in preprocessing images to make them suitable for deep learning.

We want to replicate the same preprocessing steps as the ones undergone during the original training of the pretrained models. Luckily, Keras provides a handy function, “preprocess_input” that does this for us. Before feeding any image to Keras, we want to convert it to a standard format. This standardization involves resizing the image to 224×224 pixels to ensure that the image size is uniform. Since the model has been trained to accept a batch of multiple images rather than one image at a time. Since we only have one image, we create a batch of one image to feed to our model. We can achieve that by adding an extra dimension (which represents the position of the image within that batch) at the start of our image matrix, as shown in the code below:

img_array = image.img_to_array(img)
img_batch = np.expand_dims(img_array, axis=0)

We then send the batch to the Keras preprocessing function.

The model we will use is ‘ResNet-50’, which belongs to the family of models that won the ImageNet 2015 competition in classification, detection and localization tasks. It also won the MS COCO 2015 competition in detection and segmentation tasks.

First we need to load the model. Instead of hunting for model architecture and pretrained weights on the internet, Keras provides access to them in a single function call. When you run them for the first time, they will be downloaded from a remote server.

The resulting preprocessed image is input to the model variable, which gives us probability predictions for each class. Keras also provides the “decode_predictions” function, which tells us the probability of the image belonging to a variety of relevant category names.

model = ResNet50()
img_array = image.img_to_array(img)
img_batch = np.expand_dims(img_array, axis=0)
img_preprocessed = preprocess_input(img_batch)
prediction = model.predict(img_preprocessed)
decode_predictions(prediction, top=3)[0]
[('n02123045', 'tabby', 0.50009364),
 ('n02124075', 'Egyptian_cat', 0.21690978),
 ('n02123159', 'tiger_cat', 0.2061722)]

The predicted categories for this image are various felines. Why doesn’t it simply predict the word ‘cat’ instead? The short answer is that the ResNet-50 model was trained on a granular dataset with many categories and does not include the more general ‘cat’. We will soon investigate this dataset in more detail, but first let’s load another sample image.

img_path = '../../sample_images/dog.jpg'
img = image.load_img(img_path, target_size=(224, 224))
plt.imshow(img)
plt.show()
dlcm 0203
Figure 1-3. Plot showing the contents of the file dog.jpg.

And again we load our modules:

model = ResNet50()
img_array = image.img_to_array(img)
img_batch = np.expand_dims(img_array, axis=0)
img_preprocessed = preprocess_input(img_batch)
prediction = model.predict(img_preprocessed)
decode_predictions(prediction, top=3)[0]
[('n02113186', 'Cardigan', 0.71606547),
 ('n02113023', 'Pembroke', 0.26909366),
 ('n02110806', 'basenji', 0.0051731034)]

As expected, we get different breeds of the canine family (and not just the ‘dog’ category).

Note

When using a pre-trained model, it is important to know the preprocessing steps involved in the training of the model. The same preprocessing steps need to be applied for images that are used for predictions. As an example, for a model previously trained in Caffe, preprocessing involves converting images from RGB to BGR and then zero-centering each color channel with respect to the ImageNet dataset without scaling (i.e., subtracting the mean value of each color channel in the ImageNet dataset).

Analysis

A Model Zoo in Keras

A model zoo is a place where organizations or individuals place open-source models so others can use them. These models can be trained using a particular framework (e.g. Caffe, Tensorflow, etc), for a particular task (e.g. classification, detection, etc.), or trained on a particular dataset (e.g. ImageNet, Street View House Numbers dataset, etc). Any model zoo is a collection of different models trained on a set of similar constraints.

The tradition of model zoos started with Caffe, one of the first deep learning frameworks, developed at the University of California, Berkeley. Training a deep learning model from scratch on a multi-million-image database requires weeks of training time and lots of GPU computational energy, making it a difficult task. The research community recognized this as a bottleneck, and the organizations that participated in the ImageNet competition open-sourced their trained models on Caffe’s website. Other frameworks soon followed suit.

If you are starting out on a new task, remember to first check if there is already an existing model that could be of assistance.

The Model Zoo in Keras is a collection of various architectures trained using the Keras framework on the ImageNet dataset. We tabulate their details in Table 1-3.

Table 1-3. Details of ImageNet trained models
Model Size Top-1 Accuracy Top-5 Accuracy Parameters Depth

Inception-ResNet-V2

215 MB

0.804

0.953

55,873,736

572

Xception

88 MB

0.79

0.945

22,910,480

126

Inception-v3

92 MB

0.788

0.944

23,851,784

159

DenseNet-201

80 MB

0.77

0.933

20,242,984

201

ResNet-50

99 MB

0.759

0.929

25,636,712

168

DenseNet-169

57 MB

0.759

0.928

14,307,880

169

DenseNet-121

33 MB

0.745

0.918

8,062,504

121

VGG-19

549 MB

0.727

0.91

143,667,240

26

VGG-16

528 MB

0.715

0.901

138,357,544

23

MobileNet

17 MB

0.665

0.871

4,253,864

88

The column ‘Top-1 Accuracy’ indicates how many times the best guess was the correct answer, and the column ‘Top-5 Accuracy’ indicates how many times at least one out of five guesses was correct. The ‘Depth’ of the network indicates how many layers are present in the network. The ‘Parameters’ column indicates the size of the model: the more parameters, the “heavier” the model is, and the slower it is to make predictions. In this book, you will often see us use ResNet-50 (the most common architecture cited in research papers for high accuracy) and MobileNet (for good balance between speed, size, and accuracy).

What Does My Neural Network Think?

Now we will perform a fun experiment to try to understand why the neural network made a particular prediction. What part of an image made the neural network decide that it contained, for example, a cat or a dog? It would be helpful to be able to visualize the decision-making going on within the network, which we can do with a heatmap. This tool uses color to help visually identify the areas within an image that prompted a decision. “Hot” spots, represented by warmer colors (red, orange, and yellow) highlight the areas with the maximum signal, where a signal indicates the magnitude of contribution of an area in the image towards the category being predicted.

In the GitHub repo, navigate to code/chapter2. We will find a handy Jupyter notebook ‘2_what_does_my_neural_network_think.ipynb’ which describes the following steps.

First, we will need to install the necessary libraries:

(practicaldl) $ pip3 install keras-vis --user
(practicaldl) $ pip3 install Pillow --user
(practicaldl) $ pip3 install matplotlib --user

We then run the visualization script on a single image to generate the heatmap for it:

(practicaldl) $ python3 visualization.py --process image --path ../sample_images/dog.jpg

You should see a newly created file called dog_output.jpg that shows a side-by-side view of the original image and its heatmap. As you can see from Figure 1-6, the right half of the image indicates the “areas of heat” along with the correct prediction of a ‘Cardigan (Welsh Corgi)’.

dlcm 0206
Figure 1-6. Original image of a dog and its generated heatmap.

Next, we want to visualize the heatmap for frames in a video. For that, we need ffmpeg, an open source multimedia framework. You can find the download binary as well as the installation instructions for your operating system at www.ffmpeg.org.

We will use ffmpeg to split up a video into individual frames and then run our visualization script on each of those frames. We must first create a directory to store these frames and use its name into the ffmpeg command.

(practicaldl) $ mkdir kitchen
(practicaldl) $ ffmpeg -i video/kitchen_input.mov -vf fps=25 kitchen/thumb%04d.jpg -hide_banner

We then run the visualization script with the path of the directory containing the frames from the previous step:

(practicaldl) $ python3 visualization.py --process video --path kitchen/

You should see a newly created kitchen_output directory that contains all the heatmaps for the frames from the input directory.

Finally, compile a video from those frames using ffmpeg:

(practicaldl) $ ffmpeg -framerate 25 -i kitchen_output/result_%04d.jpg kitchen_output.mp4

Perfect! Imagine generating heatmaps to analyze the strong points and shortfalls of your trained model or a pretrained model. Don’t forget to post your videos on Twitter with the hashtag #PracticalDL!

Summary

In this chapter, we got a glimpse of the deep learning universe using Keras. It’s an easy-to-use yet powerful framework that we’ll use in the next several chapters. We observed that there is often no need to collect millions of images and use powerful GPUs to train a custom model, because we can use a pretrained model to predict the category of an image. By diving deeper into datasets like ImageNet, we learned the kinds of categories these pretrained models can predict. We also learned about finding these models in model zoos that exist for most frameworks.

In the next chapter, we will explore how we can tweak an existing pretrained model to make predictions on classes of input for which it was not originally intended. As with the current chapter, our approach is geared toward obtaining output without needing millions of images and lots of hardware resources to train a classifier.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required