O'Reilly logo

Practical Deep Learning for Cloud and Mobile by Meher Kasam, Siddha Ganju, Anirudh Koul

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 4. 15 Minutes to Fame: Up and Running with Cloud APIs

A Note for Early Release Readers

This will be the 8th chapter of the final book.

If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material, please reach out to the authors at practicaldlbook@gmail.com.

As we discussed in the previous chapter, Bob just purchased a house and has found a nice fancy modern sofa. He’s quite comfortable in his new home and is constantly buying things for decoration. He likes to order his goods online which then get delivered to his doorstep. However, he recently had a package stolen off his porch, and has been paranoid ever since. He decides he needs to do something about this.

There are several expensive home security systems on the market to choose from. However, being a tinkerer, he wants to set up his own security system. His needs are simple. He wants to take photos of people in front of his door, and send a notification to his phone with the picture. That way, he can send them to the local police in the event of another theft. He devises a simple rig consisting of a motion sensor and a cheap camera. He sets it up in front of his door. It’s doing its job well in capturing images of people visiting his house. Every time the motion sensor triggers, the camera takes a picture and sends him an SMS with that picture. Unfortunately, cars and even a strong breeze end up triggering the motion sensor quite frequently, leading to him being bombarded all day with unnecessary notifications. All he wants is a simple way to tell if a frame contains a person. He figures out that deep learning must provide a solution to his conundrum.

Sure, he could train his own neural network by providing it with manually labelled data from the camera. After training the neural network, he’d have had to set up a server that hosts the model and performs predictions. This server could reside at his house on a desktop. He’d also be responsible for the maintenance of the server, such as upgrading software, including security updates, managing outages, hardware failures etc. Alternatively, he could host a VM on the cloud that would reduce some of his effort. He’d additionally have to build a secure web API to perform predictions. This is much more work than he wants to sign up for. All he needs is a ready-to-use service that gives him his predictions.

Hearing about this, Alice suggests that he look into pre-built cloud recognition APIs. Send a photo, get a prediction back, with just a few lines of code. All the big names - Amazon, Google, IBM, Microsoft - provide a similar set of computer vision APIs that tag images, detecting and recognizing faces, reading text, and sometimes handwriting. Some of them even give you the ability to train your own classifier without having to write a single line of code. How cool is that!

In the background, researchers are constantly working to improve the state of the art in computer vision. Might as well make good use of their blood, sweat, and tears (and electricity bill)!

In this chapter, we will explore several cloud based visual recognition APIs, comparing them both quantitatively as well as qualitatively. Hopefully making it easier for you to choose the one that suits your application the best. And if they still don’t match your needs, we’ll investigate how to train your own classifier, without a single line of code. So in the immortal words of the Black Eyed Peas, let’s get it started!

Visual Recognition APIs: An Overview


Clarifai was the winner of the 2013 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) classification task. Started by Matthew Zeiler, a graduate student from New York University, this was one of the first visual recognition API companies out there.


Fun fact - While investigating a classifier to detect NSFW (Not Safe For Work) images, it became important to understand and debug what was being learned by the CNN, in order to reduce false positives. This lead to them inventing a visualization technique to expose which images stimulate feature maps at any layer in the CNN. As they say, necessity is the mother of invention.

What’s unique about this API? It offers multilingual tagging in 20 languages, visual similarity search among previously uploaded photographs, face based multicultural appearance classifier, photograph aesthetic scorer, focus scorer, and embedding vector generation to help you build your own reverse image search. Through their public API, the image tagger supports 11,000 concepts.

dlcm 0701
Figure 4-1. Sample of Clarifai’s results.

Microsoft Cognitive Services

With the creation of the 152 layer CNN called ResNet-152 in 2015, Microsoft was able to win seven tasks at the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), the COCO Image Captioning Challenge as well as the Emotion Recognition in the Wild challenge - ranging from classification, detection (localization) to image descriptions. And most of this research was translated to cloud APIs. Originally starting out as Project Oxford from Microsoft Research in 2015, it was eventually renamed to Cognitive Services in 2016. It’s a comprehensive set of over 50+ APIs ranging from vision, natural language processing, speech, search, knowledge graph linkage, etc. Historically, many of the same libraries were being run at divisions like Xbox and Bing, but are now being exposed to developers externally. Some viral applications showcasing creative ways developers use these APIs include how-old.net (How Old Do I Look), Mimicker Alarm (which requires you to make a particular facial expression in order to defuse the morning alarm) and CaptionBot.ai.

What’s unique about this API? It offers image captioning and handwriting understanding, headwear recognition. Due to many enterprise customers, Cognitive Services does not use customer image data for improving their services.

dlcm 0702
Figure 4-2. Sample of Microsoft’s results.

Google Cloud Vision

Google, the home of AI and Tensorflow, provided the winning entry at the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with the help of the 22 layer GoogLeNet, which eventually paved way for the now-staple Inception architectures. Supplementing the Inception models, in Dec 2015, Google released a suite of Vision APIs. In the world of deep learning, having large amounts of data is definitely an advantage to improve one’s classifier, and Google has a lot of consumer data. For example, with learnings from Google Street View, one should expect relatively good performance in real world text extraction tasks, like on billboards.

What’s unique about this API? For human faces, it provides the most detailed facial keypoints including ‘roll’, ‘tilt’ and ‘pan’ to accurately localize the facial features. The APIs also return similar images on the web to the given input. A simple way to try out the performance of their system without writing code is by uploading photographs to Google Photos and searching through the tags.

dlcm 0703
Figure 4-3. Sample of Google’s results.

Amazon Rekognition

Yes, that last word is not a spelling mistake. Amazon Rekognition API is largely based on Orbeus, a Sunnyvale, California based startup which was acquired by Amazon in late 2015. Founded in 2012, their chief scientist also had winning entries in the ILSVRC (ImageNet) 2014 detection challenge. The same APIs were used to power PhotoTime, a famous photo organization app. Their services are available as part of the Amazon Web Services (AWS) offerings. Considering most companies offer photo analysis APIs, Amazon is doubling down on video recognition offerings.

What’s unique about this API? License plate recognition, video recognition APIs, better end-to-end integration examples of Rekognition APIs with AWS offerings like Kinesis Video Streams, Lambda, etc. Also, Amazon’s API is the only one that can tell if the subject’s eyes are open or closed.

IBM Watson Visual Recognition

Under the Watson Brand, IBM’s Visual Recognition offering started in early 2015. After purchasing AlchemyAPI, a Denver based startup, AlchemyVision has been used for powering the Visual Recognition APIs. Like others, IBM also offers custom training of your own classifiers. Surprisingly, Watson does not offer optical character recognition yet.

dlcm 0704
Figure 4-4. Sample of IBM Watson’s results.


Algorithmia is a marketplace for hosting algorithms as APIs in the cloud. Founded in 2013, this Seattle based startup has both its own in-house algorithms, as well as those created by others (in which case creators earn revenue based on the number of calls). In our experience, this API did tend to have the slowest response time.

What’s unique about this API? Colorization service (for black and white photos), image stylization, image similarity. Ability to run these services on prem, or on any cloud provider.

dlcm 0705
Figure 4-5. Sample of Algorithmia’s style transfer results.

With so many offerings, it can be overwhelming to choose a service. There are many reasons why you might choose one over the other. Obviously the biggest factors for most developers would be accuracy and price. Accuracy is the big promise that the deep learning revolution brings, and many applications require it on a consistent basis. Price of the service might be an additional factor to consider. You might also choose a service provider because your company already has a billing account with them and it would take additional effort to use a different service provider. Speed of the API response might be another factor, especially if the user is waiting on the other end for a response. Since many of these API calls can be abstracted, it’s easy to switch between different providers.

Visual Recognition APIs: A Comparison

To aid your decision making, let’s compare these APIs head-to-head. We’ll examine service offerings, cost, and accuracy of each.

Service Offerings

Let’s examine what services are being offered by each cloud provider:

Algorithmia Amazon Rekognition Clarifai Microsoft Cognitive Services Google Cloud Vision Watson Visual Recognition

Image tagging


Face recognition

Emotion recognition

Logo recognition

Landmark recognition

Celebrity recognition

Multi-lingual tagging

Image description


Thumbnail generation

Content- Moderation

Custom Training

Mobile Custom Models

Free tier

5000 requests per month

5000 requests per month

5000 requests per month

5000 requests per month

1000 requests per month



Amazon Rekognition


Cognitive Services

Google Cloud Vision

Watson Visual Recognition

That’s a mouthful of services already up and running, ready to be used in your application. Since everyone likes numbers and hard data makes decision easier, it’s time to analyze these services on two factors - cost and accuracy.


Since money doesn’t grow on trees (yet), it’s important to analyze the economics of using off-the-shelf APIs. Taking a heavy-duty example of querying these APIs at about 1 query per second (QPS) service for 1 full month (roughly 2.6 million requests per month), here’s a comparison of the different providers sorted by estimated costs (as of Feb 1, 2018):

dlcm 0706
Figure 4-6. Comparing cost for different API’s

While for most developers, this is an extreme scenario, this would be a pretty realistic load for large corporations. We will eventually compare these prices against running your own service in the cloud, to make sure you get the most bang for your buck fitting your scenario.

That said, many developers might find negligible charges, considering cloud providers often have a free tier of 5000 calls per month (except Google Vision which gives only 1000 calls per month free) and then roughly $1 per 1000 calls.


In a world ruled by marketing departments claiming their organizations to be the market leaders, how do you judge which tool is actually the best? What we need are common metrics to compare these service providers on some external datasets. We will consider the following datasets for testing specific scenarios:

  • COCO for image tagging

  • COCO-Text for text extraction

For image tagging, we’ll use the 2017 validation dataset and calculate accuracy as the total number of correct tags with regard to the total number of tags in the ground truth. Whether or not an image tag matches with the ground truth depends on one of the following two conditions being satisfied:

  1. The predicted tag is an exact match to a tag in the ground truth.

  2. The predicted tag has a close relationship to a tag in the ground truth. Examples include synonyms, gendered versions of words, or words that share a specific class. We obtain these relationships using the similarity function in Word2Vec. If one of the predicted tags has a similarity score of 0.3 or more to a tag in the ground truth, we considered the predicted tag to be a match. You may want to play around with the threshold to give you the best results for your needs.

dlcm 0707
Figure 4-7. Comparing Image Classification Accuracy

Algorithmia is running the base inception model, which is limited to 1000 tags, while most of the others have a larger taxonomy.

For OCR, we use the Word Error Rate (WER) metric on the COCO Text dataset, 2017 validation split. To keep things simple, we ignore the position of the word and only focus on whether a word is present.

dlcm 0708
Figure 4-8. Word Error Rate for different API’s

As always, all the code that we have used to do our experiments is hosted on GitHub.

The results of our analysis depend significantly on the dataset we chose, as well as our metrics. Depending on your dataset (which is in turn influenced by your use case), as well as your metrics, your results may vary. Additionally, the service providers are constantly improving their services in the background. As a consequence, these results are not set in stone. You are welcome to use our scripts on GitHub to run on your own dataset.

Get Up and Running with Cloud APIs

Calling these cloud services requires minimal code. At a high level, get an API key, load the image, specify the intent, make a POST request with the proper encoding (e.g base64 for the image), and receive the results. Most of the cloud providers provide SDKs and sample code showcasing how to call their services. They additionally provide pip installable python packages to further simplify calling these services. If you are calling Amazon Rekognition, we highly recommend using their pip package.

Let’s reuse our thrilling image to test-run these services!

First we will try it on Microsoft Cognitive Services. Get an API key and replace it in the code below. The first 5000 calls are free, more than enough for our experiments.

import httplib, urllib

def cognitiveservices_tagimage(filename):
    headers = {
        'Content-Type': 'application/octet-stream',
        'Ocp-Apim-Subscription-Key': 'REPLACE_API_KEY_HERE',

    endpoint = '/vision/v1.0/analyze?%s'
    params = urllib.urlencode({'language': 'en', 'visualFeatures': 'Description'})
    conn = httplib.HTTPSConnection('westus.api.cognitive.microsoft.com')
    conn.request('POST', endpoint % params, open(filename, 'rb'), headers)
    response = conn.getresponse()


    "description": {
        "tags": ["person", "indoor", "sitting", "food", "table", "little", "small", "dog",
        "child", "looking", "eating", "baby", "young", "front", "feeding", "holding",
        "playing", "plate", "boy", "girl", "cake", "bowl", "woman", "kitchen",
        "standing", "birthday", "man", "pizza"],
        "captions": [{
            "text": "a little girl sitting at a table with a dog",
            "confidence": 0.84265453815486435
    "requestId": "1a32c16f-fda2-4adf-99b3-9c4bf9e11a60",
    "metadata": {
        "height": 427,
        "width": 640,
        "format": "Jpeg"

“A little girl sitting at a table with a dog” - pretty close! There are other options to generate more detailed results including a probability along with each tag.


While the ImageNet dataset is primarily tagged with ‘nouns’, many of these services go beyond and return verbs like ‘eating’, ’sitting’, ’jumping’. Additionally, they might contain adjectives like ‘red’. Chances are, these might not be appropriate for your application. You may want to filter out these adjectives & verbs. One option is to check their linguistic type against Princeton’s WordNet. This is available in Python with the Natural Language Processing Toolkit (nltk). Additionally, you might want to filter out words like ‘indoor’, ‘outdoor’ (Often shown by Clarifai, Cognitive Services)

Now, let’s test the same image with Google Vision APIs. Get an API key from their website and use it in the the code below. And rejoice, because the first 1000 calls are free!

import base64, httplib, json

def googlecloud_tagimage(filename):
    with open(filename, 'rb') as image_file:
        encoded_string = base64.b64encode(image_file.read())

    endpoint = '/v1/images:annotate?key=REPLACE_API_KEY_HERE'

    request_body = {

    conn = httplib.HTTPSConnection('vision.googleapis.com')
    conn.request('POST', endpoint, json.dumps(request_body))
    response = conn.getresponse()


  "responses": [
      "labelAnnotations": [
          "mid": "/m/0bt9lr",
          "description": "dog",
          "score": 0.951077,
          "topicality": 0.951077
          "mid": "/m/06z04",
          "description": "skin",
          "score": 0.9230451,
          "topicality": 0.9230451
          "mid": "/m/01z5f",
          "description": "dog like mammal",
          "score": 0.88359463,
          "topicality": 0.88359463
          "mid": "/m/01f5gx",
          "description": "eating",
          "score": 0.7258142,
          "topicality": 0.7258142
          "mid": "/m/02xl47d",
          "description": "dog breed group",
          "score": 0.6326424,
          "topicality": 0.6326424
          "mid": "/m/039xj_",
          "description": "ear",
          "score": 0.5933325,
          "topicality": 0.5933325
          "mid": "/m/01lrl",
          "description": "carnivoran",
          "score": 0.5818591,
          "topicality": 0.5818591
          "mid": "/m/02wbm",
          "description": "food",
          "score": 0.55364263,
          "topicality": 0.55364263
          "mid": "/m/0ytgt",
          "description": "child",
          "score": 0.54263353,
          "topicality": 0.54263353
          "mid": "/m/0krfg",
          "description": "meal",
          "score": 0.53619653,
          "topicality": 0.53619653

Wasn’t that a little too easy! Getting state-of-the-art results in 5 minutes without spending 5 years on a PhD!


While these services return tags and image captions with a probability, it’s up to the developer to decide a threshold. Usually, 60% and 40% would be good thresholds for image tags and image captions, respectively. Similarity, in your business logic, you may want to put “This image contains” if the probability is over 80%, and if it’s lower, “This image may contain”.

Train your Own Classifier

Chances are these services were not quite sufficient to meet the requirements for your use case. Say you’ve sent a photograph to one of these services, and it responded with the tag “dog”. You might be more interested in identifying the breed of the dog. Of course, you can follow Chapter 1 to train your own classifier. But wouldn’t it be more awesome if you didn’t have to write a single line of code? Help is on the way.

A few of these cloud providers give you the ability to train your own classifier by simply using a drag-and-drop interface. While providing a pretty user interface, they are using transfer learning under the hood. Cognitive Services Custom Vision, Google AutoML, Clarifai, and IBM Watson all provide you the option to train your own custom classifier.

Let’s take Custom Vision (CustomVision.ai) from Microsoft as an example:

dlcm 0709
Figure 4-9. UI for CustomVision.ai.

Here are the high level steps:

  1. Create a project: Choose a domain that best describes your use case. For most purposes, “General” would be optimal. For more specialized scenarios, you might want to choose a relevant domain. As an example, if you have an e-commerce website with photos of products against a pure white background, you might want to select the “Retail” domain. If you intend to run your model on a mobile phone eventually, you might want to select the “Compact” version of the model instead. They are smaller in size with a slight loss in accuracy.

    dlcm 0710
    Figure 4-10. Creating a new project in Custom Vision.
  2. Upload: For each category, upload images and tag them. It’s important to upload at least 30 photographs per category.

    dlcm 0711
    Figure 4-11. Screenshot showing the upload images screen on CustomVision.ai. Here we upload over 30 images of Maltese dogs and tag them appropriately.
  3. Train: Hopefully, in about 3 minutes, you should have a spanking new classifier ready.

    dlcm 0712
    Figure 4-12. Screenshot displaying the ‘Train’ button on the top right corner of the CustomVision.ai page.
  4. Analyze: Analyze the performance of the model. Check the precision and recall of your model. By default, the system sets the threshold at 90% confidence, and gives you the precision and recall metrics at that value. If you want higher precision, you might want to increase the confidence threshold. This would come at the expense of reduced recall. You can an example output in Figure 4-13.

  5. Ready to go: Now you have a production-ready API endpoint which you can call from your own application!

To highlight the effect of amount of data on model quality, let’s train a dog breed classifier. We can use the Stanford dogs dataset, a collection of over 100 dog categories. For simplicity, we’ll pick 10 randomly hand picked breeds of dogs, which have at least over 200 images available. With 10 classes, a random classifier would have 1/10 = 10% chance of correctly identifying an image. We should easily be able to beat this number. Let’s observe the effect of training on datasets with different volume:

30 training images/class 200 training images/class







dlcm 0713
Figure 4-13. Relative precision and recall for our sample training set with 200 images per class.

Since we haven’t uploaded a test set, the performance figures reported here are on the full dataset using the common k-fold cross validation technique. This means the data was randomly divided into k parts, then (k-1) parts were used for training and the remaining part was used for testing. This was performed a few times, each time with a randomized subset of images, and the averaged results are reported here.

It is incredible that even with 30 images per class, the classifier’s precision is greater than 90%. And surprisingly, this took roughly under 30 seconds to train.

Not only this, we can dig down and investigate the performance on each class. Classes with high precision might visibly be more distinct, while those with low precision indicates they might look similar to another class.

dlcm 0714
Figure 4-14. Some of the possible tags returned by the API.

This short and convenient approach is not without its downsides, as we will see in the following section. We will also be discussing mitigation strategies to help leverage this rather useful tool.

Top reasons why your classifier does not work satisfactorily

  1. Not enough data: If you find that the accuracy is not quite sufficient for your needs, you might need to train the system with more data. Of course 30 images per class gets you started. But for a production quality application, more images is better. 200 images per class are usually recommended.

  2. Non-representative training data: Often, the images on the internet are way too clean, setup in studio lighting with clean backgrounds, and close to the center of the frame. Images that your application may see on a daily basis may not be represented very well by the images on the internet. It’s really important to train your classifier with real-world images for the best performance.

  3. Unrelated domain: Under the hood, Custom Vision is running transfer learning. This makes it really important to choose the right domain when creating the project. As an example, if you are trying to classify X-ray images, transfer learning from an ImageNet based model might not yield as accurate results. For cases like that, you might want to train your own classifier, as shown in Chapter 1 and Chapter 2 (though this will probably take more than 3 minutes).

  4. Using it for regression: In machine learning, there are 2 common categories of problems: classification and regression. Classification is predicting one or more classes for an input. Regression, on the other hand, is predicting a numerical value given an input. For example, predicting house prices. Custom Vision is primarily a classification system. Using it to count objects by tagging the number of objects is a wrong approach, and will lead to unsatisfactory results.

  5. Counting objects is a type of a regression problem. It can be done by localizing each instance of the object in an image (aka object detection), and counting their occurrences. Another example of a regression problem is predicting the age of a person based on their profile photo. We will tackle both problems in future chapters.

  6. Classes are too similar: If your classes look too similar and rely heavily on smaller level details for distinction, the model might not perform as well. For example, a US 5 dollar note and a US 20 dollar note have very similar high level features. It’s at the lower-level details that they are really distinct. As another example, it might be easy to distinguish between a Chihuahua and a Siberian Husky, but harder to distinguish between an Alaskan Malamute and a Siberian Husky. A fully re-trained CNN, as demonstrated in Chapter 2, should perform better than this Custom Vision-based system.


A great feature of Custom Vision is that if the model is unsure of any image that it encounters via its API endpoint, the web UI will show you those images for manual tagging. You can review and manually tag new images on a periodic basis and continuously improve the quality of your model. These images tend to improve the classifier the most for two reasons: first, they represent real-world usage. Second, and more importantly, they have more impact on the model in comparison to images that the model can already classify easily. This is known as semi-supervised learning.

In this section, we have discussed a few different ways we can improve our model’s accuracy. In the real-world, that is not the be-all and end-all of a user’s experience. How quickly you are able to respond to a request also matters a lot. In the following section, we will cover a couple of different ways that we can improve performance without sacrificing quality.

Performance Tuning

A photograph taken by a modern cell phone can have resolution as high as 4000×3000 pixels and be upwards of 4 MB in size. Depending on the network quality, it can take a few seconds to upload such an image to the service. With each additional second, it can get more and more frustrating for your user. Could we make this faster?

There are two ways to reduce the size of your image:

  1. Reducing resolution (or resizing)

  2. Compression


Most convolutional neural networks take an input image of size 224×224 pixels. Much of a cell phone photo’s resolution would be unnecessary for a CNN. It would make sense to downsize the image before sending it over the network, instead of sending a large image over the network and then downsizing it on the server. Since most networks already downscale the incoming image to their required size, a good rule of thumb is to reduce the shorter edge to 448 pixels.


While the resizing strategy works great for image classification tasks, it does not work as well for the OCR. It’s essential to send higher resolution images for OCR. The thumb rule for OCR is higher the resolution, higher the accuracy. To effectively recognize text, OCR engines need to the text to be bigger than a minimum height (a good rule of thumb is >20 pixels). Hence, as you start resizing the image, you might find a breaking point where all characters suddenly stop being recognized.


The other tool in our arsenal is compression. Most image libraries perform lossy compression while saving a file. Even a little bit of compression can go a long way in reducing the size of the image, while minimally impacting the quality of the image itself. Compression does introduce noise, but CNNs are usually robust enough to deal with some of it. A good rule of thumb is to compress by 50%. This does wonders for OCR where resizing can have an adverse impact.

Sometimes people say that the best things in life don’t come easy. We believe this chapter proves otherwise. In the following section, we will take a look at how some tech industry titans use Cloud APIs for AI to drive some very compelling scenarios.

Case Studies: Cloud APIs used across Industries


Uber uses Microsoft Cognitive Services to identify each of their 7+ million drivers in a couple of milliseconds. Imagine the sheer scale at which Uber has to operate their new feature called “Real-Time ID Check”. This feature verifies that the current driver is indeed the registered driver, by prompting them to take a selfie either randomly or every time they are assigned to a new rider. This selfie is compared to the driver’s photo on file, and only if the face models are a match, is the driver allowed to continue. This security feature is helpful for building accountability by ensuring the security of the passengers, and by ensuring that the driver’s account is not compromised. This safety feature is able to detect changes in the selfie including a hat, beard, sunglasses etc, and then prompts the driver to take a selfie without the hat or sunglasses.

dlcm 0715
Figure 4-15. Uber Drivers app prompts the driver to take a selfie to verify the identity of the driver. Source: Microsoft Blog


dlcm 0716
Figure 4-16. Futurama’s iconic ‘Shut up and Take My Money’

Back in 1976, when Dr. Richard Dawkins coined the term ‘meme’, little did he know it would take a life of its own four decades later. Instead of giving a simple textual reply, we live in a generation where most chat applications suggest an appropriate animated GIF matching the context. Several applications provide a search specific to memes and GIFs like Tenor, Facebook messenger, Swype, Swiftkey etc. Most of them search through Giphy, the world’s largest search engine for animated memes commonly in the GIF format.

GIFs often have text overlaid (like the dialogue being spoken) and sometimes we want to look for a GIF with a particular dialogue straight from a movie or TV show. For example, the above image from the 2010 Futurama episode where the ‘eyePhone’ (sic) was released, is often used to express excitement towards a product or an idea. Having understanding of the contents makes the GIFs more searchable. To make this happen, Giphy uses Google’s Vision API to extract the recognize text & objects - aiding the search for the perfect GIF.

You may realize that tagging GIFs is a hard task because a person has to sift through millions of these animations and manually annotate them frame by frame. In 2017, Giphy figured out two solutions to automate this process. The first approach was to detect text from within the image. The second approach was to generate tags based on the objects in the image to supplement the metadata for their search engine. This metadata is stored and searched using ElasticSearch to make a scalable search engine.

For text detection, they used the OCR services from Google Vision API on the first frame from the GIFs to confirm if the GIF actually contained text. If the API replied in the affirmative, then Giphy would send the next frames, receive their OCR detected texts and figure out the differences in the text, i.e. whether the text was static (remaining same throughout the duration of the gif) or dynamic (different text in different frames). For generating the class labels corresponding to objects in the image, they had two options - label detection or web entities, both of which are available on Google Vision API. Label detection, as the name suggests, provides the actual class name of the object. Web Entities provides an entity id (which may be referenceable in the Google Knowledge Graph), which is the unique web url for identical and similar images seen elsewhere on the net. Using these additional annotations gave the new system an increase in the click-through-rate (CTR) by 32%. Medium-to-long-tail searches, (i.e. not so frequent searches), benefitted the most, becoming richer with relevant content as the extracted metadata surfaced previously un-annotated GIFs that would have otherwise been hidden. Additionally, this metadata and click-through-behavior of users provides data to make a similarity and deduplication feature.


OmniEarth is a Virginia based company that specializes in collecting, analyzing, and combining satellite and aerial imagery with other data sets, to track water usage across the country, scalably, and at high speeds. They are able to scan the entire US at a total of 144 million parcels of land within hours. Internally, they use the IBM Watson Visual Recognition API to classify images of land parcels for valuable information, like how green it is. Combining this classification with other data points like temperature and rainfall, they can predict how much water was used to irrigate the field.

For house properties, they infer data points from the image such as presence of pool, trees or irrigable landscaping to predict the amount of water usage. They even predicted where water is being wasted due to malpractices like overwatering or leaks. OmniEarth helped the state of California understand water consumption by analyzing over 150,000 parcels of land, and devised an effective strategy to curb water waste.

dlcm 0717
Figure 4-17. OmniEarth’s dashboard. Source: Their website is: https://www.eagleview.com/ [Now called Eagle view]


Photobucket is a popular online image and video hosting community where over 2 million images are uploaded everyday. Using Clarifai’s Not Safe for Work (NSFW) models, Photobucket automatically flags unwanted or offensive user-generated content, and sends them for further review to their human moderation team. Previously, their human moderation team was only able to monitor about 1% of the incoming content. About 70% of the flagged images turned out to be unacceptable content. Compared to previous manual efforts, Photobucket identified 700x more unwanted content, thus cleaning the website and creating a better user experience. This automation also helped discover two child pornography accounts, which were reported to the FBI.


E-commerce stores like Staples often rely on organic search engine traffic to drive sales. One of the methods to appear high in search engine rankings is to put descriptive image tags in the ALT text field for the image. Staples Europe which serves twelve different languages found tagging product images and translating keywords to be an expensive proposition, which is traditionally outsourced to human agencies. Fortunately, Clarifai provides tags in twenty languages at a much cheaper rate, saving them costs in 5 figures. Using these relevant keywords led to an increase in the traffic and eventually increased sales through their e-commerce store due to a surge of visitors to the product pages.

InDro Robotics

This Canadian drone company uses Microsoft Cognitive Services to power search and rescue operations, not only during natural disasters, but also to proactively detect emergencies. They utilize Custom Vision to train models specifically for identifying objects such as boats and life vests in water, and use this information to notify control stations. These drones are able to scan much larger ocean spans on their own, as compared to life guards. This automation alerts the lifeguard of emergencies, thus improving speed of discovery and saving lives in the process.

dlcm 0718
Figure 4-18. Detections made by InDro Robotics.

Australia has started using drones from other companies coupled with inflatable pods to be able to react till help reaches. Soon after deployment, these pods saved two teenagers stranded in the ocean. Australia is also utilizing drones to detect sharks so that beaches can be vacated. It’s easy to forsee the tremendous value these automated, custom training services can bring.

dlcm 0719
Figure 4-19. Drone sees two teenagers stranded. Source [https://www.youtube.com/watch?v=07FA8bAV1-k]
dlcm 0720
Figure 4-20. Drone releases the inflatable pods that the teenagers cling onto. Source [https://www.youtube.com/watch?v=07FA8bAV1-k]


In this chapter, we explored various cloud APIs for computer vision, first qualitatively comparing the breadth of services offered and then quantitatively comparing their accuracy and price. We saw that with just a short code snippet, you can get started using these APIs in under 5 minutes. Since one model doesn’t fit all, we trained a custom classifier using a drag and drop interface. Finally we discussed compression and resizing recommendations to speed up image transmission and how it affects different tasks. To top it all off, we examined how companies across industries use these cloud APIs for building real world applications. Congratulations on making it this far! In the next chapter, we will see how to deploy our own inference server for custom scenarios.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required