Chapter 4. Vision
We live in a world of objects, but identifying them in pictures today can be challenging. Digital images are represented as arrays of pixels and color values with no data describing the objects that those pixels represent. However, advancements in machine learning on images are removing this barrier by providing powerful tools for extracting meaning and information from these pixels.
The Cognitive Services Vision APIs provide operations that take image data as input and return labeled content you can use in your app, whether it’s text from a menu, the expression on someone’s face, or a description of what’s going on in a video. These same services are used to power Bing’s image search, extract optical character recognition (OCR) text from images in OneNote, and index video in Azure Streams, making them tried and tested at scale.
The Vision category includes six services: Computer Vision, Custom Vision, Face, Form Recognizer, Ink Recognizer, and Video Indexer. We will provide a brief introduction to each.
Computer Vision
Computer Vision provides tools for analyzing images enabling a long list of insights including detection of objects, faces, color composition, tags, and landmarks. Behind the APIs are a set of deep neural networks trained to perform functions like image classification, scene and activity recognition, celebrity and landmark recognition, OCR, and handwriting recognition.
Many of the computer vision tasks are provided by the Analyze Image API, which supports the most common image recognition scenarios. When you make a call to the different endpoints in the API namespace, the appropriate neural network is used to classify your image. In some cases, this may mean the image passes through more than one model, first to recognize an object and then to extract additional information.
Bundling all these features into one operation means you can make one call and accomplish many tasks. For example, using a picture of a shelf in a supermarket you can identify the packaging types on display, the brands being sold, and even whether the specific products are laid out in the right order (something that is often both time-consuming and expensive to audit manually).
The Analyze Image API attempts to detect and tag various visual features, marking detected objects with a bounding box. The tasks it performs include:
-
Tagging visual features
-
Detecting objects
-
Detecting brands
-
Categorizing images
-
Describing images
-
Detecting faces
-
Detecting image types
-
Detecting domain-specific content
-
Detecting color schemes
-
Generating thumbnails
-
Detecting areas of interest
The process of working with an API through an SDK is much the same for every API. Using version 5 of the C# SDK, do the following:
-
Create a client, specifying your subscription key and endpoint:
ComputerVisionClient
computerVision
=
new
ComputerVisionClient
(
new
ApiKeyServiceClientCredentials
(
"<Your Subscription Key>"
)
)
{
Endpoint
=
"<Your Service Endpoint>"
};
-
Choose the features you want to analyze in the image:
private
static
readonly
List
<
VisualFeatureTypes
>
features
=
new
List
<
VisualFeatureTypes
>()
{
VisualFeatureTypes
.
Categories
,
VisualFeatureTypes
.
Description
,
VisualFeatureTypes
.
Faces
,
VisualFeatureTypes
.
ImageType
,
VisualFeatureTypes
.
Tags
};
-
Call the API:
ImageAnalysis
analysis
=
await
computerVision
.
AnalyzeImageAsync
(
"http://example.com/image.jpg"
,
features
);
-
Extract the response information. Here we extract the caption, but many other features are also returned:
Console
.
WriteLine
(
analysis
.
Description
.
Captions
[
0
].
Text
+
"\n"
);
Tagging Visual Features
Tagging an image is one of the most obvious uses of the Computer Vision service. This functionality provides an easy way to extract descriptors of the image that can be used later by your application. By providing many different tags for each image, you can create complex indexes for your image sets that can then be used, for example, to describe the scene depicted or find images of specific people, objects, or logos in an archive.
To use this feature, you need to upload a still image or provide a link to an image. The API returns a JSON document that contains a list of recognized objects, along with a confidence score for each. For example, an excerpt of the tags from the response for a picture of a home with a lawn (Figure 4-1) will look something like this:
"tags": [ { "name": "tree", "confidence": 0.9999969005584717 }, { "name": "grass", "confidence": 0.9999740123748779 } ]
The names of the objects are easy enough to extract, and you can use the confidence score as a cutoff to define when to apply a tag (or when to show the tag to your users). The threshold choice is up to you and your specific use case. We suggest using a high threshold to avoid false positives and poor matches cluttering up the tags and search results.
When you call the /tag
endpoint, tags that could have multiple meanings may include a hint to scope the tag to a usage. When a picture of a cyclist is tagged “riding,” the hint will note that the domain is sport (rather than geography, to avoid confusion with the Ridings, which are areas of Yorkshire), for example.
You may want to add code to convert the image tags into different terms that are more specific to your application before showing them to users, or at least go beyond a basic list structure.
Object Detection
Like the tagging API, the object detection API takes an image or an image URL and returns a JSON document with a list of detected objects, which in this case are accompanied by bounding box coordinates. The coordinates let you understand how objects are related. For example, you can determine if a cup is to the right or the left of a vase. You can also see how many instances of an object there are in a picture: unlike with the tagging API, which just returns “truck” even if there are multiple trucks in a picture, with the object detection API you get the location of each one. There are some limitations to be aware of, however; for example, it’s not possible to detect small objects or objects that are close together.
You call the object detection API via the Analyze Image API by setting the query type to "objects"
in the visualFeatures
requests parameter, or via the standalone endpoint /detect
. Here’s an excerpt of the JSON response for one of the objects in Figure 4-2:
"objects": [ { "rectangle": { "x": 1678, "y": 806, "w": 246, "h": 468 }, "object": "vase", "confidence": 0.757, "parent": { "object": "Container", "confidence": 0.759 } }, ]
As with the tagging API, the service returns hints to put classifications in context, in this case showing that a “vase” is a “container.”
Detecting Brands
The brand detection API is a specialized version of the object detection API that has been trained on thousands of different product logos from around the world and can be used to detect brands in both still images and video.
Like the object detection API, it returns details of the brand detected and the bounding box coordinates indicating where in the image it can be found. For example, you could run both object and brand detection on an image to identify a computer on a table as a Microsoft Surface laptop.
You call the object detection API with the Analyze Image API, setting the query type to "brands"
in the visualFeatures
request parameter. Objects are returned in the the JSON document’s “brands” block. The response for Figure 4-3 can be seen below the image.
"brands": [ { "name": "Microsoft", "confidence": 0.659, "rectangle": { "x": 177, "y": 707, "w": 223, "h": 235 } } ]
Categorizing an Image
The Computer Vision API can also categorize an image. This is a high-level approach, useful for filtering a large image set to quickly determine if an image is relevant and whether you should be using more complex algorithms.
There are 86 different categories, organized in a parent/child hierarchy. For example, you can get an image category of “food_pizza” in the “food_” hierarchy. If you’re building a tool to determine pizza quality to assess whether restaurant franchises are following specifications, any image that doesn’t fit the category because it’s not a pizza can be rejected without spending more time on it.
It’s a quick and easy API to use, and one that is ideal for quickly parsing a large catalog of images, as well as for an initial filter. If you need more powerful categorization tools for images, PDFs, and other documents, consider the Cognitive Search tools covered in Chapter 7. An excerpt from the JSON response returned for the photograph of a crowd of people shown in Figure 4-4 follows the image.
"categories": [ { "name": "people_crowd", "score": 0.9453125 } ]
Describing an Image
Most of the Computer Vision tools return machine-readable information, using JSON documents to deliver results that can then be processed by your code to deliver the results you need. However, you may at times need a more human-oriented response, like text that can be used as a caption. This is ideal for assistive technologies, or for providing the human-readable elements of an image catalog.
You can access the image description feature via either the /analyze
endpoint or the standalone /describe
endpoint. Descriptions are returned in a JSON document as a list ordered by confidence, with associated tags that can give additional context. Following is an excerpt of the response for a photograph of the New York City skyline (see Figure 4-5):
"description": { "tags": [ "outdoor", "photo", "large", "white", "city", "building", "black", "sitting", "water", "big", "tall", "skyscraper", "old", "boat", "bird", "street", "parked", "river" ], "captions": [ { "text": "a black and white photo of a large city", "confidence": 0.9244712774886765 } ] }
You can use the confidence level to have the tool automatically choose the highest-ranked description if you always want to get a single result, or you may prefer to show users multiple possible descriptions when the confidence levels are lower so that they can pick the most appropriate one manually.
Detecting Faces
While the Face API offers a suite of more powerful face recognition services, you can get quick access to basic facial analysis capabilities through the Analyze Image API. This detects the faces in an image, along with an indication of age and gender and bounding box coordinates.
Data is returned using the familiar JSON document format, with different responses for single and multiple faces. Your code will need to be able to work with responses with one or more face blocks, because images may contain multiple faces (as in Figure 4-6).
Here is an example of the “faces” block of the JSON response for the picture of two people in Figure 4-6:
"faces": [ { "age": 30, "gender": "Male", "faceRectangle": { "left": 1074, "top": 292, "width": 328, "height": 328 } }, { "age": 28, "gender": "Female", "faceRectangle": { "left": 947, "top": 619, "width": 308, "height": 308 } } ]
You may find this familiar—this API was the basis of the popular “How Old” service.
Detecting Image Types
Sometimes it’s useful to be able to categorize the type of image that’s being analyzed. The Analyze Image API can detect whether an image is clip art or a line drawing, returning the responses (on a simple 0 to 3 scale) in the imageType
field. A value of 0 indicates that it’s not, while a value of 3 indicates high likelihood that it is clip art or a line drawing (as in Figure 4-7).
A sketch of a rose like the one in Figure 4-7 might return the following image type information in the JSON response:
"imageType": { "clipArtType": 3, "lineDrawingType": 1 }
To detect photographs, use the same API: a 0 return value for both image types is an indication it’s neither clip art nor a line drawing.
Detecting Domain-Specific Content
While most of the Computer Vision tools are designed for general-purpose image classification, a small set of APIs are trained to work against specific image sets. Currently there are two domain-specific models available: for celebrities and for landmarks. You can use them as standalone categorization tools, or as an extension to the existing toolset.
Like the other APIs, these domain-specific models can be called by REST APIs, using the models/<model>/analyze URI with the Computer Vision namespace. Results are in the standard JSON document format and include a bounding box for the recognized object, the name, and a confidence level.
Detecting the Color Scheme
Image analysis isn’t only useful for detecting people or objects; much of the information in an image can be used in your applications. For example, if you’re looking for anomalies using computer vision, a change in color can be a useful indicator. The color scheme analysis feature in the Analyze Image API extracts the dominant foreground and background colors, as well as a set of dominant colors for an image. It also details the most vibrant color in the image as an accent color. Dominant colors are chosen from a set of 12 possibilities, while the accent is shown as an HTML color code.
The JSON response also contains a Boolean value, isBwImg
, that is is used to indicate whether an image is in color or black and white. Here is an example excerpt of the JSON response for the sunset image in Figure 4-8:
"color": { "dominantColorForeground": "Brown", "dominantColorBackground": "Black", "dominantColors": [ "Brown", "Black" ], "accentColor": "C69405", "isBWImg": false }
Generating a Thumbnail
Naively cropping an image or reducing the resolution to create thumbnails can lead to the loss of valuable information. The thumbnail API allows you to first identify the area of interest in the image for cropping. The result is a more useful thumbnail for your users. For example, if you start with a photograph of a hummingbird feeding at a flower, the API will generate a thumbnail showing the bird. The response from the service is the binary data of a cropped and resized image you can download and use in your applications.
Getting the Area of Interest
If you want to highlight an area of the image for further processing rather than cropping it, the area of interest API uses the same underlying algorithm as generating a thumbnail but returns the bounding box coordinates for you to work with.
Extracting Text from Images
The Computer Vision API has three different tools for handling text in images. The first, OCR, is an older model that uses synchronous recognition to extract small amounts of text from images. Using a standard uploaded image, it will recognize text that’s rotated up to 40 degrees from any vertical. It can return coordinates of the bounding boxes around words, so you can reconstruct sentences. However, there are issues with partial recognition and busy images. It’s best used when there’s a small amount of text in an image.
A second API, Recognize Text, is in preview and being deprecated in favor of the newer, more modern Read API; however, it’s still available if you need it. The Read API offers the best performance, but currently only supports English. Designed for text-heavy documents, it can recognize a range of text styles in both printed (PDF) and handwritten documents. The API follows a standard asynchronous process, which can take some time. The initial call returns the operationLocation
that is used to construct a URL to check the retrieve recognized text.
If a process is running it will return a “running” status code. Once you get “succeeded” you will also receive a JSON object that contains the recognized text as a string along with document analytics. Each separate word has a bounding box, a rotation, and an indicator of whether the recognition has a low confidence score.
To make a call to the Read API using the C# SDK, you first need to instantiate the client:
ComputerVisionClient
computerVision
=
new
ComputerVisionClient
(
new
ApiKeyServiceClientCredentials
(
"<Your Subscription Key>"
),
new
System
.
Net
.
Http
.
DelegatingHandler
[]
{
}
);
{
Endpoint
=
"<Your Service Endpoint>"
};
Then you can start the async process to extract the text. Here we show how to do this with an image URL, but you could also upload an image from a file:
const
string
imageUrl
=
"https://example.com/image.jpg"
;
BatchReadFileHeaders
textHeaders
=
await
computerVision
.
BatchReadFileAsync
(
imageUrl
,
TextRecognitionMode
.
Handwritten
);
Since this process happens asynchronously, the service will respond with an OperationId
that can be used to check the status of your request:
const
int
numberOfCharsInOperationId
=
36
;
string
operationId
=
operationLocation
.
Substring
(
operationLocation
.
Length
–
numberOfCharsInOperationId
);
int
i
=
0
;
int
maxRetries
=
10
;
while
((
result
.
Status
==
TextOperationStatusCodes
.
Running
||
result
.
Status
==
TextOperationStatusCodes
.
NotStarted
)
&&
i
++
<
maxRetries
)
{
result
=
await
computerVision
.
GetReadOperationResultAsync
(
operationId
);
}
Once the service is done, you can display the results. Here we just print the extracted text, but other information such as the bounding box for each word may be included in the response:
var
recResults
=
result
.
RecognitionResults
;
foreach
(
TextRecognitionResult
recResult
in
recResults
)
{
foreach
(
Line
line
in
recResult
.
Lines
)
{
Console
.
WriteLine
(
line
.
Text
);
}
}
Here is an excerpt from the JSON response for the words recognized in the image of a shopping list in Figure 4-9:
{ "boundingBox": [ 2260, 841, 2796, 850, 2796, 994, 2259, 998 ], "text": "Grocery" },
Custom Vision
For many business-specific use cases, you may find that the general image tagging and object detection services provided by the Computer Vision API are not accurate enough. The Custom Vision API solves this problem by letting you build your own custom classifier based on a relatively small set of labeled images that show the objects, conditions, and concepts you need to recognize. For example, you can use this service for very specific use cases like identifying a circuit board that wasn’t soldered correctly or distinguishing between an infected leaf and a healthy one. You can even export these models to a smartphone and give employees an app that provides real-time feedback.
Custom Vision uses a machine learning technique called transfer learning to fine-tune the generalized models based on the sample images you provide. This process lets you get great performance using only a small number of images (versus the millions used to train the general classifier). For the best results, your training set needs at least 30 to 50 images, ideally with a good range of camera angles, lighting, and backgrounds. These images should match how they will be captured in your production application. If the camera angle or background will be fixed, label common objects that will always be in the shot.
To get started building a Custom Vision model, you first need to choose whether you want a model for detecting objects or classifying the entire image. If your use case is particularly complex, you can create multiple models and layer them to improve discrimination in classes that are easy to confuse (like tomatoes and bell peppers or sandwiches and layer cakes). Where possible, it’s important to have a similar number of images for each tag. You can also add images that you tag as “negative samples,” to tell the classifier that it shouldn’t match any of your tags to these types of images.
After you choose which models you are going to create, you need to provide training data or examples of the objects or classes to the service. You can create the model and upload images through the service’s website, or in code via API calls.
After training (which takes only a few minutes), you can see the precision and recall performance of your model on the website. Precision shows what percentage of classifications are correct. To illustrate this concept, imagine if the model identified 1,000 images as bananas, but only 974 were actually pictures of bananas—that’s 97.4% precision. Recall measures the percentage of all examples of a class that were correctly identified. For example, if you had 1,000 images of bananas but the model only identified 933, the recall would be 93.3%; if the model correctly identified 992 of the 1,000 images of bananas, then the recall would be 99.2%. Figure 4-10 shows the easy-to-read graphic in the Custom Vision portal. Here you can see a breakdown of the overall precision and recall of the model, as well as the performance per tag.
Classifications are determined by a threshold you set on the returned probability for each class. For each image the model analyzes, it will return the predicted classifications (or objects) and a corresponding probability between 0 and 1. The probability is a measure of the model’s confidence that the classification is correct. A probability of 1 means the model is very confident. The service will consider any predictions with a probability greater than the threshold you set as a predicted class. Setting the threshold high favors precision over recall—classifications will be more accurate, but fewer of them will be found. Setting it low will favor recall—most of the classifications will be found, but there will be more false positives. Experiment with this and use the threshold value that best suits your project. Before launching your application, you’ll want to test your model with new images and verify performance.
For challenging data sets or where you need very fine-grained classification, the Advanced Training option in the portal lets you specify how long you want the Custom Vision service to spend training the model. In general, the longer the model trains, the better the performance is. Once you’re happy with the performance of a model, you can publish it as a prediction API from the Performance tab in the portal (or via an API) and get the prediction URL and prediction key to call in your code.
How to Train and Call a Custom Vision Model
The following code snippet shows how to train and call a Custom Vision model using version 1 of the C# SDK. First-time users may also want to walk through the steps on the website.
First, instantiate the client:
CustomVisionTrainingClient
trainingApi
=
new
CustomVisionTrainingClient
()
{
ApiKey
=
"<Your Training Key>"
,
Endpoint
=
"<Your Service Endpoint>"
};
Next, create a new project:
var
project
=
trainingApi
.
CreateProject
(
"My New Project"
);
Create the image tags to apply to recognized images:
var
japaneseCherryTag
=
trainingApi
.
CreateTag
(
project
.
Id
,
"Japanese Cherry"
);
And load the training images from disk. It is often helpful to put images with different classes in separate folders:
japaneseCherryImages
=
Directory
.
GetFiles
(
Path
.
Combine
(
"Images"
,
"Japanese Cherry"
))
.
ToList
();
We’re uploading the images in a single batch:
var
imageFiles
=
japaneseCherryImages
.
Select
(
img
=>
new
ImageFileCreateEntry
(
Path
.
GetFileName
(
img
),
File
.
ReadAllBytes
(
img
))).
ToList
();
trainingApi
.
CreateImagesFromFiles
(
project
.
Id
,
new
ImageFileCreateBatch
(
imageFiles
,
new
List
<
Guid
>()
{
japaneseCherryTag
.
Id
}));
Now we can start training the Custom Vision model:
var
iteration
=
trainingApi
.
TrainProject
(
project
.
Id
);
Training happens asynchronously, and we will keep querying to find out when the training is complete:
while
(
iteration
.
Status
==
"Training"
)
{
Thread
.
Sleep
(
1000
);
iteration
=
trainingApi
.
GetIteration
(
project
.
Id
,
iteration
.
Id
);
}
Once the iteration is trained, we publish it to the prediction endpoint (you can find the prediction resource ID in the Custom Vision portal under Settings):
var
publishedModelName
=
"<Published Model Name>"
;
var
predictionResourceId
=
"<Prediction Resource ID>"
;
trainingApi
.
PublishIteration
(
project
.
Id
,
iteration
.
Id
,
publishedModelName
,
predictionResourceId
);
Now we can start making predictions that classify images. First, we need to create a new prediction client:
CustomVisionPredictionClient
endpoint
=
new
CustomVisionPredictionClient
()
{
ApiKey
=
"<Your Prediction Key>"
,
Endpoint
=
"<Your Service Endpoint>"
};
Then we can make a prediction:
var
result
=
endpoint
.
ClassifyImage
(
project
.
Id
,
publishedModelName
,
testImage
);
And we can loop over each prediction and write out the results:
foreach
(
var
c
in
result
.
Predictions
)
{
Console
.
WriteLine
(
$
"\t{c.TagName}: {c.Probability:P1}"
);
}
Face
The Face API delivers much more detailed information than the simple face recognition feature included in the Computer Vision API. You can also use it to compare two faces or to search by face for images of the same person.
At the heart of the Face API is a set of detection tools to extract the human faces from an image and provide a bounding box to indicate where each face is in the image. It’ll also give you additional information, including details on the pose position, gender, approximate age, emotion, smile intensity, facial hair, and whether or not the person is wearing glasses (and what type). You can even extract a 27-point array of face landmarks, which can be used to give further information about a face.
The API is powerful: up to 64 different faces can be returned per image. The more faces you’re detecting, though, the more time detection takes—especially if you are extracting additional attributes. For large groups it’s better to get the minimum information your app needs, and then run deeper analysis on a face-by-face basis.
The face verification endpoint lets you verify whether two faces belong to the same person or whether a face image belongs to a specific person. This is a useful tool for identifying a user and providing a personalized experience. For offline scenarios, this endpoint is available as a container.
The tools for finding similar faces might seem like those used for face verification, but they don’t operate at the same level. Here a verified target face is compared to an array of candidate faces, helping you track down other instances of that person. Faces returned may or may not be the same person. You can switch between a more accurate matchPerson
mode and a less accurate matchFace
, which only looks for similarities.
In both cases, the API also returns a confidence score that you can use to set the cutoff point for verification or similar faces. You will need to think about what level of confidence is acceptable in your scenario. For example, do you need to err on the side of protecting sensitive information and resources?
The person identification capability is a more generalized case of face verification. For this scenario, a large database of tagged data is used to identify individuals—for example, identifying people in a photo library where a known group of friends can be used to automatically apply tags as the images are uploaded. This can be a large-scale database, with up to a million people in a group and with up to 248 different faces per person. You will need to train the API with your source data, and once trained it can be used to identify individuals in an uploaded image.
If you’ve got a group of faces and no verified images to test against, you have the option of using the face grouping tools to extract similar faces from a set of faces. The results are returned in several groups, though the same person may appear in multiple groups as they are being sorted by a specific trait (for example, a group where all the members are smiling, or one where they all have blond hair).
How to Use the Face API
The following sample code shows how to use the Face API. Don’t forget to substitute in your subscription key and image URL, as well as choosing an Azure Cognitive Services endpoint.
First, we initialize a Face client:
FaceClient
faceClient
=
new
FaceClient
(
new
ApiKeyServiceClientCredentials
(
"<Your Subscription Key>"
),
new
System
.
Net
.
Http
.
DelegatingHandler
[]
{
});
faceClient
.
Endpoint
=
"<Your Service Endpoint>"
;
Now we can detect faces and extract attributes:
IList
<
DetectedFace
>
faceList
=
faceClient
.
Face
.
DetectWithUrlAsync
(
"<Remote Image URL>"
,
true
,
false
,
{
FaceAttributeType
.
Age
,
FaceAttributeType
.
Gender
}
);
Here we extract the age and gender attributes returned by the model:
string
attributes
=
string
.
Empty
;
foreach
(
DetectedFace
face
in
faceList
)
{
double?
age
=
face
.
FaceAttributes
.
Age
;
string
gender
=
face
.
FaceAttributes
.
Gender
.
ToString
();
attributes
+=
gender
+
" "
+
age
+
" "
;
}
We can display the face attributes like so:
Console
.
WriteLine
(<
Remote
Image
URL
>);
Console
.
WriteLine
(
attributes
+
"\n"
);
Form Recognizer
Many businesses have mountains of unstructured data sitting in PDFs, images, and paper documents. While these resources may contain the data and insights needed to drive the business forward, they often sit unutilized due to the immense cost and complexity of converting them into structured data. Form Recognizer lowers this barrier and can accelerate your business processes by automating the information extraction steps. Using this service you can turn PDFs or images of forms into usable data at a fraction of the usual time and cost, so you can focus on acting on the information rather than compiling it.
The service uses advanced machine learning techniques to accurately extract text, key/value pairs, and tables from documents. It includes a prebuilt model for reading sales receipts that pulls out key information such as the time and date of the transaction, merchant information, amount of tax, and total cost—and with just a few samples, you can customize the model to understand your own documents. When you submit your input data, the algorithm clusters the forms by type, discovers what keys and tables are present, and associates values to keys and entries to tables. The service then outputs the results as structured data that includes the relationships in the original file. After you train the model, you can test and retrain it and eventually use it to reliably extract data from more forms according to your needs.
As with all the Cognitive Services, you can use the trained model by calling the simple REST APIs or using the client libraries.
Ink Recognizer
Natural user interfaces are the next evolution in the way we interact with computers. A natural interface is one that mimics or aligns with our own natural behavior, relying for example on speech, hand gestures, or handwriting detection. One of the barriers to providing a seamless natural interface for users is understanding and digitizing a person’s writings and drawings. The Ink Recognizer service provides a powerful ready-to-use tool for recognizing and understanding digital ink content. Unlike other services that analyze an image of the drawing, it uses digital ink stroke data as input. Digital ink strokes are time-ordered sets of 2D points (x,y coordinates) that represent the motion of input tools such as digital pens or fingers. The service analyzes this data, recognizes the shapes and handwritten content, and returns a JSON response containing all the recognized entities (as shown in Figure 4-11).
This powerful tool lets you easily create applications with capabilities like converting handwriting to text and making inked content searchable.
Video Indexer
Not every image you’ll want to analyze is a still image. The Video Indexer service provides both APIs and an interactive website to extract information from videos. Once you upload a video, the service will run it through a large number of models to extract useful data such as faces, emotions, and detected objects. This metadata can then be used to index or control playback. Put it all together and you can take a one-hour video, extract the people and topics, add captions, and put in links that start the video playing in the right place.
Video Indexer is a cloud application built on top of the Media Analytics service, Azure Search, and Cognitive Services. You can explore all the features the service has to offer through the website, and you can also automate video processing using the REST APIs.
Some features, like face identification, let you create custom models, but most work in much the same way as the still image analysis tools available through the Cognitive Services.
As the Video Indexer is a mix of services, you’ll need to register and obtain tokens before you can use it. You will need to generate a new access token every hour. Videos are best uploaded to public cloud services like OneDrive or an Azure Blob. Once uploaded, you must provide the location URL to the Video Indexer APIs.
Once you’ve logged in, you can start to process your videos and gain new insights. The insights cover both video and audio. Video insights include detecting faces and individuals, extracting thumbnail images, identifying objects, and extracting text. There are also insights specific to produced videos, such as identifying the opening or closing credits of a show, key frames, and blank frames (see Figure 4-12).
Audio insights include detecting language, transcribing audio (with the option of using custom language models), creating captions (with translation), detecting sounds like clapping (or silence), detecting emotions, and even identifying who speaks which words and generating statistics for how often each person speaks. You can also clean up noisy audio using Skype filters.
The Video Indexer is one of the more complex offerings in the Cognitive Services. It can require a considerable amount of programming to navigate the index object and extract usable data. You can simplify the process by working in the portal or using Microsoft’s own widgets in your applications.
Get Building Intelligent Apps with Cognitive APIs now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.