Chapter 4. Computer Vision

Some people call this artificial intelligence, but the reality is this technology will enhance us. So instead of artificial intelligence, I think we’ll augment our intelligence.

Ginni Rometty, Executive Chairman of IBM

Take a moment and look up from this book. Examine the room around you and take a quick inventory of what you see. Perhaps a desk, some chairs, bookshelves, and maybe even your laptop. Identifying these items is an effortless process for a human, even a young child.

Speaking of children, it’s quite easy to teach them the difference between multiple objects. Over time, parents show them items or pictures and then repeat the name or description. Show them a picture of an apple, and then repeat the word apple. In the kitchen, hand them an apple, and then repeat the word apple. Eventually, through much repetition, the child learns what an apple is along with its many color and shape variations—red, green, yellow. Over time, we provide information as to what is a correct example and what isn’t. But how does this translate to machines? How can we train computers to recognize patterns visually, as the human brain does?

Training computer vision models is done in much the same way as teaching children about objects. Instead of a person being shown physical items and having them identified, however, the computer vision algorithms are provided many examples of images that have been tagged with their contents. In addition to these positive examples, negative examples are also added to the training. For example, if we’re training for images of cars, we may also include negative examples of airplanes, trucks, and even boats.

Giving computers the ability to see opens up a world of possibilities for many industries—from recognizing human faces for security to automating processes that would take a human days, if not weeks. Let’s take one industry as a quick example—insurance. Using computer vision to quickly, accurately, and objectively receive an automated analysis of images—for instance, of a fender bender or a weather-damaged roof—allows insurance companies to deliver better service to their clients.

Machine learning is pattern recognition through learned examples. Nothing exemplifies this more than computer vision. Just as NLP provides a core functionality of AI in the enterprise, computer vision extends a machine’s ability to recognize and process images.

Capabilities of Computer Vision for the Enterprise

So just what can computer vision do for our applications? Before diving into the details of computer vision and how to use it in the enterprise, let’s examine some of the specific capabilities computer vision can bring to your applications.

Image Classification and Tagging

Computer vision’s most core functionality, general image tagging and classification, allows users to understand an image’s content. While you’ll often see image classification and tagging used interchangeably, it’s best to consider classification as assigning an image to one or more categories. In contrast, tagging is an assignment of a single word (or multiple words) describing the image. When an image is processed, various keyword tags or classes are returned, describing the image with varying confidence levels. Based on an application’s needs, these can be used to identify the image contents. For example, you may need to find images with the contents of “male playing soccer outside” or organize images into visual themes such as cars, sports, or fruits.

Words, phrases, and other text are frequently part of images. In our day-to-day lives, we come across street signs, documents, and advertisements. Humans see the text, read it, and comprehend it rather easily. For machines, this is an entirely different challenge. Via optical character recognition (OCR), computers can extract text from an image, enabling a wide range of potential applications. From language translation to mobile apps that assist the visually impaired, computer vision algorithms equip users to pull words out of a picture into readily usable text in applications.

In addition to general image tagging, computer vision can be used for more specific tasks. Some of the more common are the ability to detect logos and food items in images. Another frequent application of computer vision is in facial detection. With training, computer vision algorithms allow developers to recognize faces, sometimes getting even more specialized to detect celebrities.

Object Localization

Another capability of computer vision is object localization. Sometimes your application’s requirements will include not just classifying what is in the image, but also understanding the position of the particular object in relation to everything else. This is where object localization comes into play. Localization finds a specific object’s location within an image, displaying the results as a bounding box around the specified object. Similarly, object detection then identifies a varying number of objects within the image. An example is the ability to recognize faces or vehicles in an image. Figure 4-1 shows an example of object detection with dogs.

There are some challenges associated with object localization, however. Often, objects in an image overlap, making it difficult to ascertain their specific boundaries. Another challenge is visually similar items. When the colors or patterns match their background in an image, it can again be difficult to determine the objects.

Figure 4-1. Object detection

Custom Classifiers

Most of the time, you don’t need to recognize everything. If you’re looking to identify or classify only a small set of objects, custom classifiers could be the right tool. Most of the large third-party platforms provide some mechanism for building custom visual classifiers, allowing you to train the computer vision algorithms to recognize specific content within your images. Custom classifiers extend general tagging to meet your application’s particular needs to identify your visual content. They primarily exist to gain higher accuracy by reducing the search space of your visual classifier.

At a high level, when creating a custom classifier, you’ll need to have a collection of images that are identified as positive and negative examples. For example, if you were training a custom classifier on fruits, you’d want to have positive training images of apples, bananas, and pears. For negative examples, you could have pictures of vegetables, cheeses, or meats (see Figure 4-2).

Figure 4-2. Creating a custom classifier

Most computer vision uses deep learning algorithms (discussed in “Deep Learning”), specifically convolutional neural networks (CNNs). If you’re interested in building CNNs from scratch and going more in-depth with deep learning for computer vision, there are a wide variety of resources available. We recommend Andrew Ng’s Deep Learning specialization on Coursera as well as fast.ai. Additionally, the following resources dive more in-depth into relevant computer vision topics:

In the next section, we’ll start to look at how to use computer vision in enterprise applications.

How to Use Computer Vision

When discussing how to use computer vision, we’ll look at an example of the most common output—general tagging. As covered earlier, general tagging in computer vision returns to the user the overall items contained in the subject image. Frequently, depending on the algorithm or service used, the confidence levels are also returned. Based on your prototypes and testing, you can then use these scores to set your own thresholds for applying the returned tags based on your application’s needs.

To demonstrate computer vision’s tagging, we’ll use the IBM Watson Visual Recognition service. Let’s start by sending the image in Figure 4-3 to the API.

Figure 4-3. Fireworks over a harbor

We’ll make a simple request using curl to the API, referencing our image as the image URL parameter:

curl "https://gateway-a.watsonplatform.net/visual-recognition/
api/v3/classify?api_key=YOURAPIKEY&url=
https://visual-recognition-demo.ng.bluemix.net/images/samples/
7.jpg&version=2016-05-20"

Submitting this API request returns the following JSON:

{
  "classes": [
{
  "class": "harbor",
  "score": 0.903,
  "type_hierarchy": "/shelter/harbor"
},
{
  "class": "shelter",
  "score": 0.903
},
{
  "class": "Fireworks",
  "score": 0.558
},
{
  "class": "waterfront",
  "score": 0.5
},
{
  "class": "blue color",
  "score": 0.9
},
{
  "class": "steel blue color",
  "score": 0.879
}
  ],
  "classifier_id": "default",
  "name": "default"
}

As you can see from these JSON results, the API returns varying classes with confidence scores on what it thinks the image contains. If we look at the picture again and compare it with the returned keywords, the essential elements of the image are returned with high confidence. To improve these scores for groups of images, you’d need to train the computer vision model using custom classifiers. We’ll cover this more in an upcoming section, but mainly it provides the algorithm with many images of both positive and negative examples for training. A useful demo of this service can be found online.

Computer Vision on Mobile Devices

Another fascinating application of computer vision is that algorithms may run locally on mobile devices and IoT devices. Machine learning is now able to run on devices without necessarily having to connect to the cloud. Without sending and receiving data from the cloud, processing speed is increased, and security is improved.

There are some pros and cons associated with this approach, however. Let’s start with the positive aspects of running computer vision locally. First, the speed and refresh rate of object detection are significantly faster than sending and receiving data from the cloud. Specifically, not needing to wait for the cloud to respond to a request speeds the process considerably. This is quite important for consumer products, such as virtual reality (VR) and augmented reality (AR), where our eyes are very much used to an immediate reaction. Quick response times for certain queries and questions make mobile processing more natural for responding in specific applications. Privacy is another benefit, as all computation is done locally, and no personal or private data is sent to remote servers. If the application needs data analysis, it only has to send the inference result rather than the actual data, thus preserving privacy.

Some examples of areas where running computer vision locally are advantageous include language translation, drones, autonomous vehicles, the self-monitoring of industrial IoT devices, and the ability to diagnose health issues without sending private data to the cloud.

While these are some compelling reasons to use computer vision locally on mobile devices, there are some drawbacks to discuss as well. For mobile applications, battery life and resource allocation are critical issues. Running computer vision locally uses much more computing power and requires optimization, so this is a factor to take into consideration when building your application. Additionally, it’s more challenging to deploy computer vision applications locally, but is quickly becoming much easier. These days, you can even run a detection and classification system on a Raspberry Pi.

However, there is a hybrid approach that solves some of the issues—using a local model for simple object detection such as “Is that a dog?" and then sending the object to the cloud to determine the breed of dog. Another example of this hybrid approach is detecting a product while a user is in a grocery store, but then sending the data to the cloud to retrieve the actual pricing information.

The key to running models locally is model compression. You might need a large model in the cloud if you’re trying to recognize everything in the world, but the model can be much smaller if you’re merely interested in recognizing your family members’ faces. There are now ways to reduce the parameters of visual recognition models by magnitudes and thus effectively reduce their size.

Best Practices

Next, we’ll look at some best practices in developing applications with computer vision. We’ll discuss what makes good training images shortly, but a general rule of thumb is that the more high-quality images you can provide for training, the better the results.

The accuracy of custom classifiers depends on the quality of your training data as well as the process. Data from IBM Watson has shown that clients “who closely controlled their training processes have observed greater than 98% accuracy for their use cases.” This accuracy was based on the ground truth for their classification problem and data set.

Quality Training Images

Let’s now take a quick look at some guidelines for good training images. There are several characteristics of good training images. The images in your training and testing sets should ideally resemble each other in as many ways as possible. For example, be sure they’re similar yet varying regarding angle, lighting conditions, distance to the subject, and the subject’s size.

Also, make sure the training images represent what the test image will show. For example, if your test images show baskets of apples, then a close-up of a single apple would not be a good training image (Figures 4-4 and 4-5). Just because there’s an apple in the picture doesn’t mean it meets the criteria. Instead, you’d want many varying images of baskets of apples.

Figure 4-4. A single apple on a table (photo courtesy of adrianbartel, Creative Commons License 2.0)
Figure 4-5. Multiple apples in a basket (photo courtesy of Mike Mozart, Creative Commons License 2.0)

Use Cases

Now that we’ve discussed computer vision in some detail, let’s look at some industry examples and use cases.

Satellite Imaging

When a drought in California reached a crisis level in April 2015, Governor Jerry Brown issued the state’s first mandatory water restrictions. All cities and towns were instructed to reduce water usage by 25% in 10 months. Achieving this required measures more effective than just asking residents to use less water. Specifically, the state needed to run ad campaigns targeted at property owners using more water than necessary. Unfortunately, the government didn’t even have water consumption data on such a granular level.

Scientists at OmniEarth came up with the idea of analyzing aerial images to identify these property owners. They first trained IBM Watson’s Visual Recognition service on a set of aerial images containing individual homes with different topographical features, including pools, grass, turf, shrubs, and gravel. They then fed a massive amount of similar aerial imagery to Watson for classification. Partnering with water districts, cities, and counties, the scientists at OmniEarth could then quickly identify precisely which land parcels needed to reduce water consumption, and by how much. For example, they identified swimming pools in 150,000 parcels in just 12 minutes.

Armed with this knowledge, OmniEarth helped water districts make specific recommendations to property owners and governments. Such proposals included replacing a patch or percentage of turf with mulch, rocks, or a more drought-tolerant species, or draining and filling a swimming pool less frequently.

Video Search in Surveillance and Entertainment

The proliferation of cameras in recent years has led to an explosion in video data. Though videos contain numerous insights, these are hard to extract using computers. In many cases, such as home surveillance videos, the only viable solution is still human monitoring. That is, a human sits in an operations center 24/7, watching screens and raising an alert when something happens, or reviewing dozens or even hundreds of hours of past footage to identify key events.

BlueChasm is a company looking to tackle this problem using computer vision.  The founder, Ryan VanAlstine, believes that if successful, video can be a new form of sensor where traditional detection and human inspection fail. BlueChasm’s product, VideoRecon, can watch and listen to videos, identifying key objects, themes, or events within the footage. It will then tag and timestamp those events and then return the metadata to the end user.

The industry that the company plans to focus on first is law enforcement. Ryan explains: “Imagine you work in law enforcement, and you knew there was a traffic camera on a street where a burglary occurred last night. You could use VideoRecon to review the footage from that camera and create a tag whenever it detected a person going into the burgled house. Within a few minutes, you would be able to review all the sections of the video that had been tagged and find the footage of the break-in, instead of having to watch hours of footage yourself.” Once a video is uploaded, IBM Watson’s Visual Recognition is used to analyze the video footage and identify vehicles, weapons, and other objects of interest.

BlueChasm is also looking to help media companies analyze movies and TV shows. It wants to track the episodes, scenes, and moments where a particular actor appears, a setting is shown, or a line of dialog is spoken. This metadata could help viewers find a classic scene from a show they’d like to rewatch by simply typing in some search terms and navigating straight to the relevant episode. More generally, for any organization that manages an extensive video archive, the ability to search and filter by content is an enormous time-saver.

Additional Examples: Social Media and Insurance

Just as NLP was found to be incredibly useful in social media and content discovery, computer vision also proves to be similarly beneficial. For example, a company called Ampsy uses computer vision to grab all the historical images shared by an audience on social media to retrieve a list of activities, interests, people, and places for an influencer. Additionally, Ampsy uses custom visual classifiers to train on specific corporate logos. It can then detect all the images in a particular event where the advertiser’s logos are captured. The company is then offering its users the capability to search for their logos in the collected images.

Similarly, social media analytics company iTrend uses general tagging to understand the content within images gathered from social media, blogs, and live streaming feeds. The company claims that it can analyze 20–50 times more data than its competition and then provide up to 80% of its findings as actionable insights.

Finally, another industry using computer vision in innovative ways is insurance. Primarily used in processing claims, visual recognition techniques and augmented reality are used by insurance companies to make the claims process much faster, decreasing costs and increasing customer satisfaction. They’re accomplishing this through applications that allow customers to take multiple photos of accident and vehicle damage and send these to the insurer’s servers for processing, eliminating much of the manual human processing that was previously required. According to Accenture, 82% of insurers believe AI-driven automation will be embedded in every aspect of their business over the next five years. Additionally, 35% of insurers report more than 15% in cost savings from this automation over the past two years. This is an industry using AI and computer vision techniques to their fullest potential to improve their business.

Existing Challenges in Computer Vision

In Chapter 2 on NLP, we discussed the challenges of extracting meaningful information from text, and doing the same for images is just as challenging. Let’s forget about recognizing tens of thousands of objects, a complex classification problem, and instead examine a much simpler case of just identifying cats versus dogs in an image.

If you didn’t have experience in machine learning before approaching this problem, you might start by writing an algorithm that separates the animals by color. But what if the pictures are black and white? Maybe you’d then try to separate them by color and texture. But what if it’s a Golden Retriever instead of a Labrador? You get the point—building a successful visual classification system requires more than a rote list of rules, because you can always find exceptions that break them. The correct approach is to develop an algorithm that will automatically generalize to a set of rules based on a large enough data set. Over time, as you train the vision model by providing more positive and negative examples to learn from, the algorithm improves, ultimately providing an incredibly useful method for identifying objects within images.

Similar to our discussion of NLP challenges, several applications of computer vision still pose considerable challenges to implementation. First is facial recognition. While computer vision is very capable of face detection (detecting the presence of faces), face recognition (identifying individuals) does not come easily out of the box. Large data sets for training on individual faces are needed for this to be an effective process.

Another area of concern is detecting details. If you want to classify an image based on a small section of it or the details scattered within it, this tends to be challenging. Some services analyze an entire image when training, so they may struggle on classifications that depend on small details. A solution to this problem is breaking down the image into smaller pieces or zooming into the image’s relevant parts.

Implementing a Computer Vision Solution

As with the other AI solutions discussed in this book, you’ll always face the decision of build versus buy when determining an implementation. As in previous chapters, we’ll look briefly at some of the popular open source options and the SaaS services available.

As always, you need to try these yourself via testing and prototyping to see which meets your needs, including but not limited to price, accuracy, and the available code libraries.

Regarding open source, two of the primary options for building applications using computer vision are OpenCV and SimpleCV. They both provide access to computer vision, with OpenCV being a library available for many programming languages and SimpleCV a framework incorporating several libraries written in Python.

In addition to IBM Watson Visual Recognition, other companies are providing similar computer vision services as APIs. These include Clarifai, Google Cloud Vision, Azure Cognitive Services, and Amazon Rekognition. Each service has its pros and cons, so we again recommend testing and building prototypes for those that appear to meet your needs to find the proper fit.

Summary

Computers “see” very differently than humans, and almost always within the context of how they are trained. The computer vision topics covered in this chapter have hopefully inspired you to use vision as a powerful tool to increase the utility of your enterprise applications. As we’ve discussed, numerous industries are already taking advantage of computer vision with great success. How might you incorporate computer vision in your applications?

Now that we’ve covered some of the significant use cases of AI in the enterprise (NLP, chatbots, and computer vision), let’s take a look at our data and how it gets processed in AI data pipelines.

Get Getting Started with Artificial Intelligence, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.