Grid of images after transformations are performed.
Grid of images after transformations are performed. (source: Tuhin Sharma, used with permission)

Digital marketing is the marketing of products, services, and offerings on digital platforms. Advertising technology, commonly known as "ad tech," is the use of digital technologies by vendors, brands, and their agencies to target potential clients, deliver personalized messages and offerings, and analyze the impact of online spending: sponsored stories on Facebook newsfeeds; Instagram stories; ads that play on YouTube before the video content begins; the recommended links at the end of a CNN article, powered by Outbrain—these all are examples of ad tech at work.

In the past year, there has been a significant use of deep learning for digital marketing and ad tech.

In this article, we will delve into one part of a popular use case: mining the Web for celebrity endorsements. Along the way, we’ll see the relative value of deep learning architectures, run actual experiments, learn the effects of data sizes, and see how to augment the data when we don’t have enough.

Use case overview

In this article, we will see how to build a deep learning classifier that will predict the company, given an image with logo. This section provides an overview of where this model could be used.

Celebrities endorse a number of products. Quite often, they post pictures on social media showing off a brand they endorse. A typical post of that type contains an image, with the celebrity and some text they have written. The brand, in turn, is eager to learn about the appearance of such postings, and to show them to potential customers who might be influenced by them.

The ad tech application, therefore, works as follows: large numbers of postings are fed to a processor that figures out the celebrity, the brand, and the message. Then, for each potential customer, the machine learning model generates a very specific advertisement based on the time, location, message, brand, customers' preferred brands, and other things. Another model identifies the target customer base. And the targeted ad is now sent.

Figure 1 shows the workflow:

Celebrity brand-endorsement bot workflow
Figure 1. Celebrity brand-endorsement bot workflow. Image by Tuhin Sharma.

As you can see, the system is composed of a number of machine learning models.

Consider the image. The picture could have been taken in any setting. The first goal is to identify the objects and the celebrity in the picture. This is done by object detection models. Then, the next step is to identify the brand, if one appears. The easiest way to identify the brand is by its logo.

In this article, we will look into building a deep learning model to identify a brand by its logo in an image. Subsequent articles will talk about building some of the other pieces of the bot (object detection, text generation, etc.).

Problem definition

The problem addressed in this article is: given an image, predict the company (brand) in the image by identifying the logo.

Data

To build machine learning models, access to high-quality data sets are imperative. In real-life, the data scientists will work with brand managers and agencies to get all possible logos.

For the purpose of this article, we will leverage the FlickrLogo data set. This data set has real-world images from Flickr, a popular photo sharing website. The FlickrLogo page has instructions on how to download the data. Please download the data if you want to use the code in this article to build your own models.

Models

Identifying the brand from its logo is a classic computer vision problem. In the past few years, deep learning has become the state-of-the-art for computer vision problems. We will be building deep learning models for this use case

Software

In our previous article, we talked about the strengths of Apache MXNet. We also talked about Gluon, the simpler interface on top of MXNet. Both are extremely powerful and allow deep learning engineers to experiment rapidly with various model architectures.

Let's now get to the code.

Libraries

Let's first import the libraries we need for building the models:

import mxnet as mx
import cv2
from pathlib import Path
import os
from time import time
import shutil
import matplotlib.pyplot as plt
%matplotlib inline

Load the data

From the FlickrLogos data sets, let's use the FlickrLogos-32 data set. <flickrlogos-url> is the URL to this data set.

%%capture
!wget -nc <flickrlogos-url> # Replace with the URL to the dataset
!unzip -n ./FlickrLogos-32_dataset_v2.zip

Data preparation

The next step is to create the following data sets:

  1. Train
  2. Validation
  3. Test

The FlickrLogos already has train, validation and test data sets, dividing the images as follows:

  • The train data set has 32 classes, each containing 10 images.
  • The validation data set has 3,960 images, of which 3,000 images have no logos.
  • The test data set has 3,960 images.

While the train images all have logos, the validation and test images have no logos. We want to build a model that generalizes well. We want a model that predicts correctly on images that weren't used for training (validation and test images).

To make our learning faster, with better accuracy, for the purpose of this article, we will move 50% of the no-logo class from the validation data set to the training set. So, we will make the training data set of size 1,820 (after adding 1,500 no-logo images from validation set) and reduce the validation data set size to 2,460 (after moving out 1,500 no-logo images). In a real-life setting, we will experiment with different model architectures to choose the one that performs well on the actual validation and test data sets.

Next, define the directory where the data is stored.

data_directory = "./FlickrLogos-v2/"

Now, define the path to the train, test, and validation data sets. For validation, we define two paths: one for the images containing logos and one for the rest of the images without logos.

train_logos_list_filename = data_directory+"trainset.relpaths.txt"
val_logos_list_filename = data_directory+"valset-logosonly.relpaths.txt"
val_nonlogos_list_filename = data_directory+"valset-nologos.relpaths.txt"
test_list_filename = data_directory+"testset.relpaths.txt"

Let's now read the filenames for train, test, and validation (logo and non-logo) from the list just defined.

The list is given in the FlickrLogo data set, which has already categorized the images as train, test, validation with logo, and validation without logo.

# List of train images 
with open(train_logos_list_filename) as f:
    train_logos_filename = f.read().splitlines()
# List of validation images without logos
with open(val_nonlogos_list_filename) as f:
    val_nonlogos_filename = f.read().splitlines()
# List of validation images with logos    
with open(val_logos_list_filename) as f:
    val_logos_filename = f.read().splitlines()
# List of test images 
with open(test_list_filename) as f:
    test_filenames = f.read().splitlines()

Now, move some of the validation images without logos to the set of train images. This set will end up with all the train images and 50% of no-logo images from the validation data set. The validation set will end up with all the validation images that have logos and the remaining 50% of no-logo images.

train_filenames = train_logos_filename + val_nonlogos_filename[0:int(len(val_nonlogos_filename)/2)]
val_filenames = val_logos_filename + val_nonlogos_filename[int(len(val_nonlogos_filename)/2):]

To verify what we’ve done, let's print the number of images in the train, test and validation data sets.

print("Number of Training Images : ",len(train_filenames))
print("Number of Validation Images : ",len(val_filenames))
print("Number of Testing Images : ",len(test_filenames))

The next step in the data preparation process is to set the folder paths in a way that makes model training easy.

We need the folder structure to be like Figure 2.

folder structure for data
Figure 2. Folder structure for data. Image by Tuhin Sharma.

The following function helps us create this structure.

def prepare_datesets(base_directory,filenames,dest_folder_name):
    for filename in filenames:
        image_src_path = base_directory+filename 
        image_dest_path = image_src_path.replace('classes/jpg',dest_folder_name)
        dest_directory_path = Path(os.path.dirname(image_dest_path))
        dest_directory_path.mkdir(parents=True,exist_ok=True)
        shutil.copy2(image_src_path, image_dest_path)

Call this function to create the train, validation, and test folders with the images placed under them within their respective classes.

prepare_datesets(base_directory=data_directory,filenames=train_filenames,dest_folder_name='train_data')
prepare_datesets(base_directory=data_directory,filenames=val_filenames,dest_folder_name='val_data')
prepare_datesets(base_directory=data_directory,filenames=test_filenames,dest_folder_name='test_data')

The next step is to define the hyperparameters for the model.

We have 33 classes (32 logos and 1 non-logo). The data size isn't huge, so we will use only one GPU. We will train for 20 epochs and use 40 as the batch size for training.

batch_size = 40
num_classes = 33
num_epochs = 20
num_gpu = 1
ctx = [mx.gpu(i) for i in range(num_gpu)]

Data pre-processing

Once the images are loaded, we need to ensure the images are of the same size. We will resize all the images to be 224 * 224 pixels.

We have 1,820 training images, which is really not much data. Is there a smart way to get more data? An astounding yes. An image, when flipped, still means the same thing, at least for logos. A random crop of the logo is also still the same logo.

So, we do not need to add images for the purposes of our training, but instead can transform some of the existing images by flipping them and cropping them. This helps us get a more robust model.

Let's flip 50% of the training data set horizontally and crop them to 224 * 224 pixels.

train_augs = [
    mx.image.HorizontalFlipAug(.5),
    mx.image.RandomCropAug((224,224))
]

For the validation and test data sets, let's center crop to get each image to 224 224. All the images in the train, test, and validation data sets will now be of 224 224 size.

val_test_augs = [
    mx.image.CenterCropAug((224,224))
]

To perform the transforms we want on images, define the function transform. Given the data and the augmentation type, it performs the transformation on the data and returns the updated data set.

def transform(data, label, augs):
    data = data.astype('float32')
    for aug in augs:
        data = aug(data)
    # from (H x W x c) to (c x H x W)
    data = mx.nd.transpose(data, (2,0,1))
    return data, mx.nd.array([label]).asscalar().astype('float32')

Gluon has an utility function to load image files: mx.gluon.data.vision.ImageFolderDataset. It requires the data to be available in the folder structure illustrated in Figure 2.

The function takes in the following parameters:

  • Path to the root directory where the images are stored
  • A flag to instruct if images have to be converted to greyscale or color (color is the default option)
  • A function that takes the data (image) and its label and transforms them

The following code shows how to transform the image when loading:

train_imgs = mx.gluon.data.vision.ImageFolderDataset(
    data_directory+'train_data',
    transform=lambda X, y: transform(X, y, train_augs))

Similarly, the transformations are applied to the validation and test data sets and are loaded.

val_imgs = mx.gluon.data.vision.ImageFolderDataset(
    data_directory+'val_data',
    transform=lambda X, y: transform(X, y, val_test_augs))
test_imgs = mx.gluon.data.vision.ImageFolderDataset(
    data_directory+'test_data',
    transform=lambda X, y: transform(X, y, val_test_augs))

DataLoader is the built-in utility function to load data from the data set, and it returns mini-batches of data. In the above steps, we have the train, validation, and test data sets defined ( train_imgs, val_imgs, test_imgs respectively). The num_workers attribute lets us define the number of multi-processing workers to use for data pre-processing.

train_data = mx.gluon.data.DataLoader(train_imgs, batch_size,num_workers=1, shuffle=True)
val_data = mx.gluon.data.DataLoader(val_imgs, batch_size, num_workers=1)
test_data = mx.gluon.data.DataLoader(test_imgs, batch_size, num_workers=1)

Now that the images are loaded, let's take a look at them. Let's write a utility function called show_images that displays the images as a grid:

def show_images(imgs, nrows, ncols, figsize=None):
    """plot a grid of images"""
    figsize = (ncols, nrows)
    _, figs = plt.subplots(nrows, ncols, figsize=figsize)
    for i in range(nrows):
        for j in range(ncols):
            figs[i][j].imshow(imgs[i*ncols+j].asnumpy())
            figs[i][j].axes.get_xaxis().set_visible(False)
            figs[i][j].axes.get_yaxis().set_visible(False)
    plt.show()

Now, display the first 32 images in a 8 * 4 grid:

for X, _ in train_data:
    # from (B x c x H x W) to (Bx H x W x c)
    X = X.transpose((0,2,3,1)).clip(0,255)/255
    show_images(X, 4, 8)
    break
grid of images
Figure 3. Grid of images after transformations are performed. Image by Tuhin Sharma.

Results are shown in Figure 3. Some of the images seem to contain logos, often truncated.

Utility functions for training

In this section, we will define utility functions to do the following:

  • Get the data for the batch being currently processed
  • Evaluate the accuracy of the model
  • Train the model
  • Get the image, given a URL
  • Predict the image's label, given the image

The first function, _get_batch, returns the data and label, given the batch.

def _get_batch(batch, ctx):
    """return data and label on ctx"""
    data, label = batch
    return (mx.gluon.utils.split_and_load(data, ctx),
            mx.gluon.utils.split_and_load(label, ctx),
            data.shape[0])

The function evaluate_accuracy returns the classification accuracy of the model. We have chosen a simple accuracy metric for the purpose of this article. In practice, the accuracy metric is chosen based on the application need.

def evaluate_accuracy(data_iterator, net, ctx):
    acc = mx.nd.array([0])
    n = 0.
    for batch in data_iterator:
        data, label, batch_size = _get_batch(batch, ctx)
        for X, y in zip(data, label):
            acc += mx.nd.sum(net(X).argmax(axis=1)==y).copyto(mx.cpu())
            n += y.size
        acc.wait_to_read()
    return acc.asscalar() / n

The next function we will define is the train function. This is by far the biggest function we will create in this article.

Given an existing model, the train, test, and validation data sets, the model is trained for the number of epochs specified. Our previous article contained a more detailed overview of how this function works.

Whenever the best accuracy on the validation data set is found, the model is checkpointed. For each epoch, the train, validation, and test accuracies are printed.

def train(net, ctx, train_data, val_data, test_data, batch_size, num_epochs, model_prefix, hybridize=False, learning_rate=0.01, wd=0.001):

    net.collect_params().reset_ctx(ctx)
    if hybridize == True:
        net.hybridize()
    loss = mx.gluon.loss.SoftmaxCrossEntropyLoss()
    trainer = mx.gluon.Trainer(net.collect_params(), 'sgd', {
        'learning_rate': learning_rate, 'wd': wd})

    best_epoch = -1
    best_acc = 0.0

    if isinstance(ctx, mx.Context):
        ctx = [ctx]
    for epoch in range(num_epochs):
        train_loss, train_acc, n = 0.0, 0.0, 0.0
        start = time()
        for i, batch in enumerate(train_data):
            data, label, batch_size = _get_batch(batch, ctx)
            losses = []
            with mx.autograd.record():
                outputs = [net(X) for X in data]
                losses = [loss(yhat, y) for yhat, y in zip(outputs, label)]
            for l in losses:
                l.backward()
            train_loss += sum([l.sum().asscalar() for l in losses])
            trainer.step(batch_size)
            n += batch_size

        train_acc = evaluate_accuracy(train_data, net, ctx)
        val_acc = evaluate_accuracy(val_data, net, ctx)
        test_acc = evaluate_accuracy(test_data, net, ctx)
        print("Epoch %d. Loss: %.3f, Train acc %.2f, Val acc %.2f, Test acc %.2f, Time %.1f sec" % (
            epoch, train_loss/n, train_acc, val_acc, test_acc, time() - start
        ))
        if val_acc > best_acc:
            best_acc = val_acc
            if best_epoch!=-1:
                print('Deleting previous checkpoint...')
                os.remove(model_prefix+'-%d.params'%(best_epoch))
            best_epoch = epoch
            print('Best validation accuracy found. Checkpointing...')
            net.collect_params().save(model_prefix+'-%d.params'%(epoch))

The function get_image returns the image from a given URL. This is used for testing the model's accuracy

def get_image(url, show=False):
    # download and show the image
    fname = mx.test_utils.download(url)
    img = cv2.cvtColor(cv2.imread(fname), cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, (224, 224))
    plt.imshow(img)
    return fname

The final utility function we will define is classify_logo. Given the image and the model, the function returns the class of the image (in this case, the brand name) and its associated probability.

def classify_logo(net, url):
    fname = get_image(url)
    with open(fname, 'rb') as f:
        img = mx.image.imdecode(f.read())
    data, _ = transform(img, -1, val_test_augs)
    data = data.expand_dims(axis=0)
    out = net(data.as_in_context(ctx[0]))
    out = mx.nd.SoftmaxActivation(out)
    pred = int(mx.nd.argmax(out, axis=1).asscalar())
    prob = out[0][pred].asscalar()
    label = train_imgs.synsets
    return 'With prob=%f, %s'%(prob, label[pred])

Model

Understanding the model architecture is quite important. In our previous article, we built a multi-layer perceptron (MLP). The architecture is shown in Figure 4.

multi-layer perceptron
Figure 4. Multi-layer perceptron. Image by Tuhin Sharma.

How would the input layer for an MLP model be? Our data is 224 * 224 pixels in size.

The most common way to create the input layer from that is to flatten it and create an input layer with 50,176 (224 * 224) neurons, ending up with a simple bit stream as shown in Figure 5.

flattened input
Figure 5. Flattened input. Image by Tuhin Sharma.

But image data has a lot of spatial information that is lost when such flattening is done. And the other challenge is the number of weights. If the first hidden layer has 30 hidden neurons, the number of parameters in the model will be 50,176 * 30 + 30 bias units. So, this doesn't seem to be the right modeling approach for images.

Let's now discuss the more appropriate architecture: a convolutional neural network (CNN) for image classification.

Convolutional neural network (CNN)

CNNs are similar to MLPs, in the sense that they are also made up of neurons whose weights we learn. The key difference is that the inputs are images, and the archicture allows us to exploit the properties of the images into the architecture.

CNNs have convolutional layers. The term "convolution" is taken from image processing, and it is described by Figure 6. This works on a small window, called a "receptive field," instead of all the inputs from the previous layer. This allows the model to learn localized features.

Each layer moves a small matrix, called a kernel, over the part of the image fed to that layer. It adjusts each pixel to reflect the pixels around it, an operation that helps identify edges. Figure 6 shows an image on the left, a 3x3 kernel in the middle, and the results of applying the kernel to the top-left pixel on the right. We can also define multiple kernels, representing different feature maps.

convolutional layer
Figure 6. Convolutional layer. Image by Tuhin Sharma.

In the example in Figure 6, the input image was 5x5 and the kernel was 3x3. The computation was an element-wise multplication between the two matrices. The output was 5x5.

To understand this, we need to understand two parameters at the convolution layer: stride and padding.

Stride controls how the kernel (filter) moves along the image.

Figure 7 illustrates the movement of the kernel from the first pixel to the second.

kernel movement
Figure 7. Kernel movement. Image by Tuhin Sharma.

In the Figure 7, the stride is 1.

When a 5x5 image is convolved with a 3x3 kernel, we will be getting a 3x3 image. Consider the case where we add a zero padding around the image. The 5x5 image is now surrounded with 0. This is illustrated in Figure 8.

zero padding
Figure 8. Zero padding. Image by Tuhin Sharma.

This, when multipled by a 3x3 kernel, will result in a 5x5 output.

So, for the computation shown in Figure 6, it had a stride of 1 and padding of size 1.

CNN works with drastically fewer weights than the corresponding MLP. Say we use 30 kernels, each with 3x3 elements. Each kernel has 3x3 = 9 + 1 (for bias) parameters. This leads to 10 weights per kernel, 300 for 30 kernels. Contraste this against the 150,000 weights for the MLP in the previous section.

The next layer is typically a sub-sampling layer. Once we have identified the features, this sub-sampling layer simplifies the information. A common method is max pooling, which outputs the greatest value from each localized region of the output from the convolutional layer (see Figure 9). It reduces the output size, while preserving the maximum activation in every localized region.

max pooling
Figure 9. Max pooling. Image by Tuhin Sharma.

You can see that it reduces the output size, while preserving the maximum activation in every localized region.

A good resource for more information on CNNs is the online book, Neural Networks and Deep Learning. Another good resource is Stanford University's CNN course

Now that we have learned the basics of what CNN is, let’s go and implement it for our problem using gluon.

The first step is to define the architecture:

cnn_net = mx.gluon.nn.Sequential()
with cnn_net.name_scope():
    #  First convolutional layer
    cnn_net.add(mx.gluon.nn.Conv2D(channels=96, kernel_size=11, strides=(4,4), activation='relu'))
    cnn_net.add(mx.gluon.nn.MaxPool2D(pool_size=3, strides=2))
    #  Second convolutional layer
    cnn_net.add(mx.gluon.nn.Conv2D(channels=192, kernel_size=5, activation='relu'))
    cnn_net.add(mx.gluon.nn.MaxPool2D(pool_size=3, strides=(2,2)))
    # Flatten and apply fullly connected layers
    cnn_net.add(mx.gluon.nn.Flatten())
    cnn_net.add(mx.gluon.nn.Dense(4096, activation="relu"))
    cnn_net.add(mx.gluon.nn.Dense(num_classes))

Now that the model architecture is defined, let's initialize the weights of the network. We will use the Xavier initalizer.

cnn_net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)

Once the weights are initialized, we can train the model. We will call the same train function defined earlier and pass the required parameters for the function.

train(cnn_net, ctx, train_data, val_data, test_data, batch_size, num_epochs,model_prefix='cnn')

Epoch 0. Loss: 53.771, Train acc 0.77, Val acc 0.58, Test acc 0.72, Time 224.9 sec
Best validation accuracy found. Checkpointing...
Epoch 1. Loss: 3.417, Train acc 0.80, Val acc 0.60, Test acc 0.73, Time 222.7 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 2. Loss: 3.333, Train acc 0.81, Val acc 0.60, Test acc 0.74, Time 222.5 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 3. Loss: 3.227, Train acc 0.82, Val acc 0.61, Test acc 0.75, Time 222.4 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 4. Loss: 3.079, Train acc 0.82, Val acc 0.61, Test acc 0.75, Time 222.0 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 5. Loss: 2.850, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 222.7 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 6. Loss: 2.488, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 222.1 sec
Epoch 7. Loss: 1.943, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 221.3 sec
Epoch 8. Loss: 1.395, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 223.6 sec
Epoch 9. Loss: 1.146, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 222.5 sec
Epoch 10. Loss: 1.089, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 221.5 sec
Epoch 11. Loss: 1.078, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 220.7 sec
Epoch 12. Loss: 1.078, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 221.1 sec
Epoch 13. Loss: 1.075, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 221.3 sec
Epoch 14. Loss: 1.076, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 221.3 sec
Epoch 15. Loss: 1.076, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 220.4 sec
Epoch 16. Loss: 1.075, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 221.3 sec
Epoch 17. Loss: 1.074, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 221.8 sec
Epoch 18. Loss: 1.074, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 221.8 sec
Epoch 19. Loss: 1.073, Train acc 0.82, Val acc 0.61, Test acc 0.76, Time 220.9 sec

We asked the model to run for 20 epochs. Typically, we train for many epochs and pick the model at the epoch where the validation accuracy is the highest. Here, after 20 epochs, we can see from the log just shown that the model's best validation accuracy was in epoch 5. After that, the model doesn't seem to have learned much. Probably, the network was saturated and learning took place very slowly. We’ll try out a better approach in the next section, but first we’ll see how our current model performs.

Collect the parameters of the epoch that had the best validation accuracy and assign it as our model parameters:

cnn_net.collect_params().load('cnn-%d.params'%(5),ctx)

Let’s now check how the model performs on new data. We’ll get an easy-to-recognize images from the Web (Figure 10) and see the model's accuracy.

img_url = "http://sophieswift.com/wp-content/uploads/2017/09/pleasing-ideas-bmw-cake-and-satisfying-some-bmw-themed-cakes-crustncakes-delicious-cakes-128x128.jpg"
classify_logo(cnn_net, img_url)

'With prob=0.081522, no-logo'

BMW logo
Figure 10. BMW logo. Image by Tuhin Sharma.

The model’s prediction has been terrible. It predicts the image to have no logo with probability of 8%. The prediction is wrong and the probability is quite weak.

Let’s try one more test image (see Figure 11) to see whether accuracy is any better.

img_url = "https://dtgxwmigmg3gc.cloudfront.net/files/59cdcd6f52ba0b36b5024500-icon-256x256.png"
classify_logo(cnn_net, img_url)

'With prob=0.075301, no-logo'

foster’s logo
Figure 11. Foster’s logo. Image by Tuhin Sharma.

Yet again, the model’s prediction is wrong and the probability is quite weak.

We don't have much data, and the model training has saturated, as just seen. We can experiment with more model architectures, but we won’t overcome the problems of small data sets and trainable parameters much greater than the number of training images. So, how do we get around this problem? Can't deep learning be used if there isn't much data?

The answer to that is transfer learning, discussed next.

Transfer learning

Consider this analogy: you want to pick up a new foreign language. How does the learning happen?

You would take a conversation, say, for example: Instructor: How are you doing? You: I am good. How about you?

And you will try to learn the equivalent of this in the new language.

Because of your proficiency in English, you don't start learning a new language from scratch (even if it seems that you do). You already have the mental map of a language, and you try to find the corresponding words in the new language. Therefore, in the new language, while your vocabulary might still be limited, you will still be able to converse because of your knowledge of the structure of conversations in English.

Transfer learning works the same way. Highly accurate models are built on data sets where a lot of data is available. A common data set that you will come across is the ImageNet data. It has more than a million images. Researchers from around the world have built many different state-of-art models using this data. The resulting model, comprised of model architecture and weights, is freely available on the internet.

And starting from that pre-trained model, we will train the model for our problem. In fact, this is quite the norm. Almost invariably, the first model one would build for a computer vision problem would employ a pre-trained model.

In many cases, like our example, this might be all one can do—if restricted for data.

The typical practice is to keep many of the early layers fixed, and train only the last layers. If data is quite limited, only the classifier layer is re-trained. If data is moderately abundant, the last few layers are re-trained.

This works because a convolutional neural network learns higher level representation at each successive layer; the learning it has done at many of the early layers is held in common by all image classification problems.

Let's now use a pre-trained model for logo detection.

MXNet has a model zoo with a number of pre-trained models.

We will use a popular pre-trained model called resnet. The paper provides a lot of details on the model structure. A simpler explanation can be found in this article.

Let's first download the pre-trained model:

from mxnet.gluon.model_zoo import vision as models

pretrained_net = models.resnet18_v2(pretrained=True)

Since our data set is small, we will re-train only the output layer. We randomly initialize the weights for the output layer:

finetune_net = models.resnet18_v2(classes=num_classes)
finetune_net.features = pretrained_net.features
finetune_net.output.initialize(mx.init.Xavier(magnitude=2.24))

We now call the same train function as before:

train(finetune_net, ctx, train_data, val_data, test_data, batch_size, num_epochs,model_prefix='ft',hybridize = True)

Epoch 0. Loss: 1.107, Train acc 0.83, Val acc 0.62, Test acc 0.76, Time 246.1 sec
Best validation accuracy found. Checkpointing...
Epoch 1. Loss: 0.811, Train acc 0.85, Val acc 0.62, Test acc 0.77, Time 243.7 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 2. Loss: 0.722, Train acc 0.86, Val acc 0.64, Test acc 0.78, Time 245.3 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 3. Loss: 0.660, Train acc 0.87, Val acc 0.66, Test acc 0.79, Time 243.4 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 4. Loss: 0.541, Train acc 0.88, Val acc 0.67, Test acc 0.80, Time 244.5 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 5. Loss: 0.528, Train acc 0.89, Val acc 0.68, Test acc 0.80, Time 243.4 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 6. Loss: 0.490, Train acc 0.90, Val acc 0.68, Test acc 0.81, Time 243.2 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 7. Loss: 0.453, Train acc 0.91, Val acc 0.71, Test acc 0.82, Time 243.6 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 8. Loss: 0.435, Train acc 0.92, Val acc 0.70, Test acc 0.82, Time 245.6 sec
Epoch 9. Loss: 0.413, Train acc 0.92, Val acc 0.72, Test acc 0.82, Time 247.7 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 10. Loss: 0.392, Train acc 0.92, Val acc 0.72, Test acc 0.83, Time 245.3 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 11. Loss: 0.377, Train acc 0.92, Val acc 0.72, Test acc 0.83, Time 244.5 sec
Epoch 12. Loss: 0.335, Train acc 0.93, Val acc 0.72, Test acc 0.84, Time 244.2 sec
Epoch 13. Loss: 0.321, Train acc 0.94, Val acc 0.73, Test acc 0.84, Time 245.0 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 14. Loss: 0.305, Train acc 0.93, Val acc 0.73, Test acc 0.84, Time 243.4 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 15. Loss: 0.298, Train acc 0.93, Val acc 0.73, Test acc 0.84, Time 243.9 sec
Epoch 16. Loss: 0.296, Train acc 0.94, Val acc 0.75, Test acc 0.84, Time 247.0 sec
Deleting previous checkpoint...
Best validation accuracy found. Checkpointing...
Epoch 17. Loss: 0.274, Train acc 0.94, Val acc 0.74, Test acc 0.84, Time 245.1 sec
Epoch 18. Loss: 0.292, Train acc 0.94, Val acc 0.74, Test acc 0.84, Time 243.9 sec
Epoch 19. Loss: 0.306, Train acc 0.95, Val acc 0.73, Test acc 0.84, Time 244.8 sec

The model starts right away with a higher accuracy. Typically, when data is less, we train only for a few epochs and pick the model at the epoch where the validation accuracy is the highest.

Here, epoch 16 has the best validation accuracy. Since the training data is limited, and the model kept on training, it has started to overfit. We can see that after epoch 16, while training accuracy is increasing, validation accuracy has begun to decrease.

Let's collect the parameters from the corresponding checkpoint of the 16th epoch and use it as the final model.

# The model's parameters are now set to the values at the 16th epoch
finetune_net.collect_params().load('ft-%d.params'%(16),ctx)

Evaluating the predictions

For the same images that we used earlier to evaluate the predictions, let's see the prediction of the new model.

img_url = "http://sophieswift.com/wp-content/uploads/2017/09/pleasing-ideas-bmw-cake-and-satisfying-some-bmw-themed-cakes-crustncakes-delicious-cakes-128x128.jpg"
classify_logo(finetune_net, img_url)

'With prob=0.983476, bmw'

bmw logo 2
Figure 12. Image by Tuhin Sharma.

We can see that the model is able to predict BMW with 98% probability.

Let's now try the other image we tested earlier.

img_url = "https://dtgxwmigmg3gc.cloudfront.net/files/59cdcd6f52ba0b36b5024500-icon-256x256.png"
classify_logo(finetune_net, img_url)

'With prob=0.498218, fosters'

While the prediction probability isn't good, a tad lower than 50%, Foster's still gets the highest probability amongst all the logos.

Improving the model

To improve the model, we need to fix the way we constructed the training data set. Each individual logo had 10 training points. But as part of distributing the no-logo images from validation to training, we moved 1,500 images to training as no logo. This introduces a significant data set bias. This is not a good practice. The following are some options to fix this:

  • Weight the cross-entroy loss.
  • Don't include the no-logo images in the training data set. Build a model that predicts low class probabilities to all logos if it doesn't exist in test/validation images.

But remember, that even with transfer learning and data augmentation, we only have 320 images, and this is quite low to build highly accurate deep learning models.

Conclusion

In this article, we learned how to build image recognition models using MXNet. Gluon is ideal for rapid prototyping. Moving from prototyping to production is also quite easy with hybridization and symbol export. With a host of pre-trained models available on MXNet, we were able to get very good models for logo detection in pretty quick time. A very good resource for learning more about the underying theory is the Stanford's CS231n course.

Article image: Grid of images after transformations are performed. (source: Tuhin Sharma, used with permission).