Chapter 4. Using Azure Cognitive Services to Build Intelligent Applications
In the previous chapter we looked at how a cloud service like Azure Machine Learning helps you focus on building and training machine learning models without needing to create an entire machine learning environment from scratch. But not every developer will want to build their own machine learning models, so in this chapter we’re going to show you how to use ready-made AI services that you can use out of the box or customize using your own training data and call like any other API.
Using Prebuilt AI
The term “AI” is used very broadly these days and covers many different approaches, but techniques for having computers perform tasks that we used to think only people could do, like understanding and learning from experience, are fundamental. They include “cognitive” tasks like recognizing speech and images to improve customer service, detecting faces in a photo or even using a selfie to authenticate to an app, understanding speech that’s full of the product names and technical terms used in your industry, or synthesizing speech from text.
Want to let your users take a photograph of a menu, translate it into another language, and get photographs showing what their food might look like? How about creating a chatbot that can deliver text and voice chat for customer support but also recognize pictures of your products that customers send in, spot whether the item is broken, and kick off the return process? Those are all powerful AI-powered features that you could build into your existing apps and tools using APIs for these cognitive tasks.
This is a fast-moving area of AI, with new algorithms and techniques being developed all the time that are complex to implement. Using prebuilt but customizable APIs that deliver cognitive tasks as a cloud service gives developers a fast way to take advantage of the business value AI can bring and give their apps a human side, without having to become data scientists. You don’t have to build the model, manage the production environment for a machine learning system—or secure it.
You don’t have to train the models used in Cognitive Services (although you can build custom models in some services). Microsoft delivers pretrained models as services and regularly updates them with improved training sets to ensure that they stay relevant and can work with as wide a range of source materials as possible. New and improved algorithms and models are regularly added to the different services; you may find your app gets more powerful without you needing to make any changes, or there will be new options to work with. In some cases, developers get access to new models as quickly as the teams inside Microsoft.
The Bing Search app for iOS and Android can generate speech that sounds almost exactly like a person speaking; that’s important because research shows it’s much less tiring to listen to results, directions, or something longer like an audiobook with the natural intonations of a human voice and with all the words articulated clearly.
Using deep neural networks to do voice synthesis and prosody (matching the patterns of stress and intonation in speech) together rather than as separate steps produces more natural and fluid speech. This is a relatively new development that was in research labs a couple of years ago, and new research papers are still coming out with refinements. But several months before the Bing team added neural voice synthesis to their mobile app, the Cognitive Services Speech API already included a preview of two neural text-to-speech voices in English, followed by Chinese, German, and Italian voices. Now companies like Progressive Insurance use custom neural voices: the Flo chatbot speaks with the voice of actor Stephanie Courtney, thanks to Cognitive Services.
Even companies with deep expertise in AI turn to these services rather than creating their own implementations. When Uber wanted to ensure the person driving the car was the registered driver who was supposed to show up as your ride, even if they’d cut their hair or changed their glasses since they got their ID photo taken, they used the Face API in Azure Cognitive Services to have drivers take a selfie on their phone. The team at Uber uses machine learning extensively and even builds open source tools for machine learning development. But they chose Cognitive Services because they were able to deliver the new feature in a few lines of code rather the months of development it would have taken to build face detection into their own platform.
The REST APIs and client SDKs (for languages including .NET, Python, Java, and Node.js) available through Azure Cognitive Services let developers use and customize the latest AI models for computer vision, text and video analytics, speech, and knowledge understanding without needing to implement, train, or host their own models. Cognitive Services can be called from Azure Functions and Azure App Service, or from within Apache Spark, Azure Databricks, Azure Synapse Analytics, and other data processing services if you need to enrich or annotate big data. (They’re also available inside the Power Platform and in Logic Apps for no-code and low-code developers: we’ll be covering how to use those in Chapter 6.)
As cloud services, Cognitive Services work at scale, for thousands or millions of users, reaching 150 countries, from more than 30 Azure regions around the world, with data stored and retained in compliant ways to give users control over their data. (Check out Chapter 9 for the details of what it takes to scale up machine learning services like this.) The APIs run with strict SLAs and are guaranteed to be available at least 99.9% of the time. Services are localized into multiple languages, with some services available in over 100 different languages and dialects. Speech-to-text, for example, is available in 197 and complies with ISO, SOC2, and HIPAA standards.
But you can also take some of the most useful Cognitive Services and run them locally, by building the trained model right into a smartphone app that uses the AI offload hardware on the phone, or running them in a container inside an IoT device where they can work directly with sensor readings as they’re generated.
That’s ideal for the remote, demanding environments where IoT devices are the most useful, and connectivity is slow, expensive, or both. This also addresses questions of data governance; if you’re using image recognition to analyze medical documents for insurance, you don’t have to worry about compliance issues when taking them outside the hospital network to analyze them in the cloud.
The core Cognitive Services provide skills in the areas of speech, vision, and language, including the Azure OpenAI Service, as well as services for making decisions and detecting anomalies (and you can call multiple services in the same app).
Azure Applied AI Services, which we cover in the next chapter, combine these core services into tools for common scenarios, like understanding video or processing paperwork and documents. Azure Form Recognizer uses vision and language Cognitive Services and business logic to automate dealing with forms, as you can see in Figure 4-1.
Think of Cognitive Services as the building blocks that let any developer build an AI-powered solution: Applied AI Services add task-specific AI models and business logic for common problems like digital asset management, extracting information from documents, and analyzing and reacting to real-time data.
The Core Azure Cognitive Services
There are dozens of different Cognitive Services grouped into the key areas shown in Figure 4-2: speech, text, vision, and decision making, plus OpenAI. We don’t have space to cover them all in detail here;1 instead, we’re going to show you how to use some of the most popular services—but you should find working with any of the APIs and SDKs a similar experience.
Language
Analyze, understand, and translate text with the language APIs (or use them together with the speech services we previously mentioned).
You can turn your FAQ into an interactive chatbot with the Question Answering service, extract not just keywords but the intent of what users are saying or typing with Language Understanding, or translate in near-real time, using the terms that matter in your own business and industry. That includes full document translation, even of complex PDF files, preserving the layout and format.
Tip
The Text Analytics API takes raw text and extracts the sentiment behind the text, the key phrases it uses, the language it’s written in, and the entities it refers to. A sentence refers to “Rover”; is it a dog or a car? Is “Mars” the planet or the British candy bar? Entity recognition can find time zones, temperatures, numbers and percentages, places, people, quantities, businesses, dates and times, URLs, and email addresses. There’s a healthcare-specific entity recognition service that can extract medical information from unstructured documents like doctor’s notes and electronic health records, detecting terms that might be a diagnosis, a condition or symptom, the name of a medicine, part of the body, and other important medical concepts.
Tip
Use sentiment analysis to filter comments and reviews from customers to feature on your site, or take the results and feed them into Power BI to generate actionable data. Use the opinion mining option to pull out subjects and opinions to get more details. If a message appears to be a complaint, you can see not only the negative sentiment rating, but also terms like “room” or “handle” and phrases like “was cold” or “broke off in my hand,” allowing you to respond quickly to customer problems.
Key phrases and entities aren’t enough to give you the intent of every phrase or sentence. We all have different ways of talking and typing, using different words to mean the same thing. When someone ordering a pizza through a chat interface asks for it to be delivered to their digs, what do they mean? Could they really want their pizza in a hole?
LUIS maps keywords to a list of things you expect your users to be asking for and turns a conversation into a list of ways an app or chatbot can respond.
Adding LUIS to a travel agency chatbot, for example, can narrow the information needed to help a customer. A statement like “I need a flight to Heathrow” will be parsed with the intent “BookTravel” and entities “flight” and “London.” Prefilling those entities into a booking engine starts the booking process, while the chatbot prompts the user for additional information, like dates, class of travel, and the departure airport. You can see how LUIS extracts intent from a text string in Figure 4-3.
LUIS is not a general-purpose machine learning model; to get the most out of it, you have to train it with domain-specific data for your industry, location, and scenarios. A set of prebuilt models for specific domains can help you get started, but they’ll need additional training if you want the best results.
Translator
Microsoft Translator is a cloud-based machine translation service with multilanguage support for translation, transliteration, language detection, and dictionaries: it handles more than 100 languages and dialects. The core service is the Translator Text API, which is used in multiple Microsoft products as well as being available through Cognitive Services. The same API also powers speech translation, and we’ll talk about that next.
The Translator Text API uses deep learning-powered neural machine translation, the technique that has revolutionized machine translation in the last decade, with more accurate translations that sound more natural and fluent because it translates words as part of a full sentence, rather than only looking at a few words around each target word to explain it. The translation engine also makes multiple attempts at a translation, learning from previous translation to refine the result. The result is a more human-sounding translation, particularly with non-Western languages. Check the current list of supported languages.
You access the models through a REST API. While you can force operation through a specific region, Microsoft recommends using the Global option, which dynamically directs calls to any available endpoint, usually the one closest to the request location. The following Python code snippet calls a translation from English to Dutch and Brazilian Portuguese:
import requests, uuid, json subscription_key = "YOUR_SUBSCRIPTION_KEY" endpoint = "https://api.cognitive.microsofttranslator.com" location = "YOUR_RESOURCE_LOCATION" path = '/translate' constructed_url = endpoint + path params = { 'api-version': '3.0', 'from': 'en', 'to': ['nl', 'pt'] } constructed_url = endpoint + path headers = { 'Ocp-Apim-Subscription-Key': subscription_key, 'Ocp-Apim-Subscription-Region': location, 'Content-type': 'application/json', 'X-ClientTraceId': str(uuid.uuid4()) } # You can pass more than one object in body. body = [{ 'text': 'YOUR TEXT TO TRANSLATE' }] request = requests.post(constructed_url, params=params, headers=headers, json=body) response = request.json()
You can have more than one target language in your translation request, with each translation in the same JSON return. The response contents indicate the detected source language and include a translation block with text and a language identifier for each selected target language:
[ { "translations": [ { "text": "TRANSLATED TEXT IN DUTCH", "to": "nl" }, { "text": "TRANSLATED TEXT IN PORTUGUESE", "to": "pt" } ] } ]
The returned data can be parsed by your choice of JSON library.
For additional translations you can use the Dictionary Lookup API, which will return alternates for the phrase you submit. The JSON data returned will have both the source and translated text, with a back translation to help you check that the translation is correct. The response will also give you details about the word or phrase you’re translating.
You may also want to identify the language that’s being used, so you don’t waste API calls on the wrong language pairing or content that can’t be translated.
Transliterating text is a useful tool for, say, converting Japanese or Chinese pictographs or Cyrillic text to a Western transliteration. In the REST request, set a from script and a to script, with the text you wish to transliterate in the JSON body. When run, the returned JSON will contain the transliterated text.
Combine the different capabilities to create your own translation service—for example, detecting Japanese text, transliterating it to Western script at the same time as translating it, while displaying any alternate translations that might be available.
The Translator Text APIs are extensible; if you need to tag only a few product names, you can apply markup to those phrases to supply the way they should be translated. But if you need translations to cover industry-specific terms, or language that’s essential to your business, Custom Translator lets you extend the default translation neural network models.
Tip
The main metric for machine translation is Bilingual Evaluation Understudy (BLEU) score: a number from 0 (the worst score) to 1 (the best); this is calculated by comparing a translation done by your model to existing reference translations done by human translators.
Custom Translator supports more than three dozen languages and lets you add words and phrases that are specific to your business, industry, or region. You can build a new translation model using “parallel” documents: pairs of documents that have already been translated so they have the same content in two languages in common formats. The service can also match sentences that are the same content in separate documents; either way, you need at least 10,000 parallel sentences. You can also supply a dictionary of specific words, phrases, and sentences that you always want translated the same way; that’s useful for product names, technical terms that need to match the region, or legal boilerplate.
Training is relatively quick, on the order of a few hours, and can result in a significant improvement in both text and voice translations.
Upload dictionaries, training, tuning, and test documents for each language pair you want to use in the Custom Translator portal, where you can also share access with colleagues working on the same translation project. Or you can upload training data (as shown in Figure 4-4) and leave Custom Translator to build the tuning and test sets when you click “Create model.”
You can also add training data via an API and even use that API to build your own interfaces to the service, either adding it to your own document portal or making submission an automatic step in a document translation workflow.
As well as translating snippets of text, you can translate entire documents and keep the text formatting. Document translation works on PDF, Word, PowerPoint, CSV/Excel, Outlook messages, OpenDocument, HTML, Markdown, RTF, tab separate, and plain-text files. They can be up to 40 MB in size, in batches of up to 250 MB, but they can’t be secured with a password or information protection. Store the documents in containers in Azure Blob storage (we show a suggested cloud architecture for this workflow in Figure 4-5): you can translate individual documents or the entire container.
Warning
All API requests to the Document Translation service need a read-only key for authenticating access and a custom domain endpoint with your resource name, hostname, and Translator subdirectories (https://<NAME-OF-YOUR-RESOURCE>.cognitiveservices.azure.com/translator/text/batch/v1.0). This isn’t the same as the global translator endpoint (api.cognitive.microsofttranslator.com
) or the endpoint listed on the Keys and Endpoint page in your Azure portal.
The translations are high quality, but you may want to use them as a first step and have a native speaker improve on them before you use them. If you do use them directly, whether it’s a custom or standard translation, it’s important to let your users know that the content they’re reading used machine translation (and to give them a way to let you know if there are problems with the translation).
Azure OpenAI Service
If you’ve used the GitHub Copilot extension to generate code suggestions or had your grammar corrected while learning a language with Duolingo, you’ve seen OpenAI’s GPT-3 large language model in action. Several Microsoft products already have features based on OpenAI. Dynamics 365 marketing uses it to suggest content to include in marketing messages. Power BI uses it to let less experienced users say in natural language what they want to do with their data and get complex DAX queries written for them (a task with a steep learning curve).
The OpenAI API lets you apply GPT-3 to a wide range of language and code tasks including extraction, classification, and translation by sending a few free text examples of what you want to see (called the prompt), which it analyses and uses for pattern matching, predicting the best text to include in the response, which is also delivered as free text. This technique is known as “in-context” learning. Context can be easily preserved by including prior responses in the text sent for each API call, so if the interface in your app allows it, users will be able to ask questions of their data iteratively.
Tip
If you want a much deeper understanding of GPT-3, check out another O’Reilly book, GPT-3: Building Innovative NLP Products Using Large Language Models by Sandra Kublik and Shubham Saboo.
It’s useful for content generation to help a user who needs some help with creative writing or generating summaries of articles or conversations, perhaps extracting the gist of a customer support call and creating action items or triaging top issues for human review. It can search through documents to find answers to user questions, matching user query intent to how the documents are semantically related and extracting keywords or generating summaries, either to condense long text generally or to extract key points. You could create an “I don’t understand this” button for education and training scenarios where OpenAI can rephrase the content in different words to help explain it. Simple ranking and answer extraction doesn’t need the power of OpenAI: use it when you have more generative, open-ended questions that need the flexibility and power.
Tip
The GPT-3 models offer the best performance with English, although they have some knowledge of French, German, Russian, and other languages. Although Codex models are most capable in Python, they can generate code in over a dozen languages including JavaScript, Go, Ruby, and SQL. New iterations of the model are regularly being released, so be sure to check the documentation for the latest guidance.
You can choose from four base GPT-3 models (Ada, Babbage, Curie, and Davinci) that can understand and generate natural language, as well as the Codex series of models that can understand and generate code, turning natural language prompts into code or explaining code in natural language. Use Codex for suggesting code that developers will review and refine, or to make your internal APIs and services more accessible to less-proficient developers by explaining how they work and offering on-demand code examples:
-
Ada is the fastest GPT-3 model and is a good fit for tasks that don’t require too much nuance, like parsing text and correcting addresses: you could use it to extract patterns like airport codes.
-
Davinci is the most capable model. It can perform all the tasks the other models can, often with fewer prompts, and delivers the best results on tasks that require more understanding of the content, like taking bullet points and generating different lengths of content like suggested headlines or marketing messages, or summarizing content in the right tone for specific audiences: you could choose a summary for schoolchildren or ask for a more business or professional tone. But it’s also the largest model and requires a lot more compute power.
You can experiment with different models to see which gives you the best trade-off between speed and capability. You can also choose between different prompt approaches as you move from quick prototyping to creating a customized model that you can scale for production.
To use the service, create an Azure OpenAI resource using the same procedure as any other Azure Cognitive Service. Once the resource has been created, Azure will generate access keys and an endpoint for use from your own code.
To process text, the service first breaks the text down into chunks called tokens. One token is roughly equal to a short word or a punctuation mark.
For example, a word like “doggerel” would be tokenized as “dog,” “ger,” and “el,” while “cat” would be a single token. The number of tokens used in a call will determine the cost and response speed of an operation. The API calls are limited by the number of tokens, which depend on the length of the input, output, and parameters. Use this OpenAI tool to see how text is broken up into tokens.
Unlike other Cognitive Services, the OpenAI models use free text for input and output, using the natural language instructions and examples you provide as a prompt to set the context and predict probable next text.
This code will generate text using the Davinci natural language model, and the prompt you include will determine which of the three in-context learning techniques are used: zero-shot, few-shot, or one-shot learning.
Don’t think of this as retraining the model in the usual machine learning sense—you’re providing prompts at generation time, not updating the weights in a model. Instead, the models generate predictions about what the best text to return is, based on the context you include in the prompt, so providing different prompts as examples will give you different results:
import os import openai openai.api_key = os.getenv("AZURE_OPENAI_API_KEY") response = openai.Completion.create( engine="text-davinci-001", prompt="{Customer's Prompt}", temperature=0, max_tokens=100, top_p=1, frequency_penalty=0, presence_penalty=0, stop=["\n"] )
- Zero-shot
- You don’t have to give an example in the prompt. For quick prototyping, just state the objective and the model will generate a response. Accuracy and repeatability will depend heavily on your scenario. Models fine-tuned with your own data will let you use zero-shot prompts with greater accuracy.
- Few-shot
- You’ll typically need a few more examples in the prompt to demonstrate the format and the level of detail you want in the response, to make the text generated for you more relevant and reliable. There’s a maximum input length, but depending on how long examples are, you can include up to around a hundred of them (though you may not need that many).
- One-shot
- Where you want to show the format for the response, but you don’t expect the service to need multiple examples, you can provide just one example.
The OpenAI Service is stochastic: even with the same prompts, you won’t necessarily get the same results every time (so if you use this in a chatbot or interface, it should feel fresh rather than predictable). If you ask for multiple results when you send the prompt, you can control the amount of variation in those results with the temperature parameter: the higher the value, the more variation you’ll see.
Warning
You’re not guaranteed to get as many responses as you request: sometimes the response returned may be blank, so you need to check for that and handle it in your code.
Experiment with zero-, one-, and few-shot prompts from different models to see what gets you the best result, and then use the API to submit a fine-tuning job with your prompt and completion examples to get a customized model you can deploy for testing and production.
Warning
Because the OpenAI Service produces text that sounds like a human wrote it, it’s important both to ensure that the content generated is appropriate for the way you’re going to use it and to make sure it can’t be misused. Learn how to create a responsible AI strategy for this in Chapter 7.
Speech
Speech recognition was one of the earliest areas of applied AI research, but it’s only in recent years that deep learning has made it powerful enough to use widely. The very first successful implementation of deep learning instead of the traditional speech recognition algorithms was funded by Microsoft Research, helping to transform the industry. In 2017, a system built by Microsoft researchers outperformed not just individuals but a team of humans, accurately transcribing the recorded phone conversations of the industry standard Switchboard dataset.
The Azure Speech Services cover speech-to-text, text-to-speech, and real-time translation of speech in multiple languages. You can customize speech models for specific acoustic environments, like a factory floor or background road noise, and to recognize and pronounce jargon; we’ll look at how to do that in the next chapter. Or you can recognize specific speakers or even use voice authentication for access and security with speaker identification and speaker verification. Speech services are available through the Speech SDK, the Speech Devices SDK, or REST APIs.
Using the Azure speech recognition tools requires working with the Cognitive Services Speech SDK. The following snippet of code loads a speech recognizer, looking for user intent in their utterances, using LUIS as a backend to the recognition process. Here we’re controlling a basic home automation application, looking to turn a service on and off. The app will take the first submission from the user and use this to drive our hypothetical backend service. Finally, we check if an intent is recognized, or if valid speech is detected, before failing or cancelling operations:
import azure.cognitiveservices.speech as speechsdk print("Say something...") intent_config = speechsdk.SpeechConfig( subscription="YourLanguageUnderstandingSubscriptionKey", region="YourLanguageUnderstandingServiceRegion") intent_recognizer = speechsdk.intent.IntentRecognizer(speech_config=intent_config) model = speechsdk.intent.LanguageUnderstandingModel(app_id= "YourLanguageUnderstandingAppId") intents = [ (model, "HomeAutomation.TurnOn"), (model, "HomeAutomation.TurnOff"), ] intent_recognizer.add_intents(intents) intent_result = intent_recognizer.recognize_once() if intent_result.reason == speechsdk.ResultReason.RecognizedIntent: print("Recognized: \"{}\" with intent id '{}'".format(intent_result.text, intent_result.intent_id)) elif intent_result.reason == speechsdk.ResultReason.RecognizedSpeech: print("Recognized: {}".format(intent_result.text)) elif intent_result.reason == speechsdk.ResultReason.NoMatch: print("No speech could be recognized: {}".format(intent_result.no_match_details)) elif intent_result.reason == speechsdk.ResultReason.Canceled: print("Intent recognition canceled: {}".format(intent_result.cancellation_details.reason))
Speech-to-text
Transcription used to require hours of time and specialized equipment for a trained human to turn speech into text, using a system that’s more like drawing gestures than normal typing. It was expensive, and even commercial services don’t always reach the 95% accuracy of the best human transcription.
Azure’s speech-to-text tools work with real-time streamed audio data or prerecorded audio files. A single subscription covers all the Cognitive Services speech services, so you get access to translation and text-to-speech alongside the speech-to-text services.
The core speech-to-text service delivers real-time transcriptions using the same technology as Teams and Word, so it’s been proven in a wide range of conditions with many accents and in multiple languages. Turn to Chapter 11 to see how it’s used alongside speech translation in some very large organizations.
While you can specify the language to use, which may give more accurate recognition, the service default is a universal model with automatic language detection that works well in most situations. The list of supported languages is long and continues to grow, covering most European languages, Arabic, Thai, Chinese, and Japanese. Not all languages have the same level of available customization, but even without customizing the language model you’re using, you should be able to get acceptable results in office or home applications.
Speech-to-text is available through a set of SDKs and REST APIs. As the service is primarily intended to be used with streamed data, it’s easiest to use the SDKs, as these give you direct access to audio streams, including device microphones and local audio recording files. The REST APIs are useful for quick speech commands, adding speech controls to mobile apps or the web. If you’ve built custom language understanding models in LUIS, you can use these in conjunction with Speech Services to extract the speaker intent, making it easier to deliver what your user is asking for.
Calls to the SpeechRecognizer are run using asynchronous connections to Azure, handling connections to device microphones in the SDK and recognizing data until a set amount of silence is found. Calls can send either short speech or long utterances for recognition, and the transcribed text is delivered once the asynchronous process is complete. The SDK returns recognized speech as a string, with error handling for failed recognitions.
Text-to-speech
Speech synthesis is useful for industrial settings where users might not be able to look at a device screen—or might be wearing a HoloLens. It’s also important for accessibility. Increasingly, it’s also used to give products and services a recognizable voice for chatbots and other ways consumers interact with brands.
The text-to-speech services convert text into synthesized speech that’s natural and sounds near human. You can pick from a set of standard and higher-quality “neural” voices, or if you want to express your brand’s personality you can create your own voices.
Currently, more than 75 standard voices are available, in over 45 languages and locales. If you want to experiment with the new neural synthesized voices, you can choose between five options in four languages and locales.
Neural text-to-speech is a powerful new improvement over standard speech synthesis, offering human-sounding inflection and articulation and making computer-generated speech less tiring to listen to. It’s ideal if you’re using speech to deliver long-form content—for example, narrating scenes for the visually impaired or when generating audiobooks from web content. It’s also a useful tool when you’re expecting a lot of human interaction, for high-end chatbots or for virtual assistants. Built using deep neural networks, neural voices synthesize speech and apply patterns of stress and intonation to that speech in a single step, which makes the generated speech sound much more fluent and natural.
Standard speech synthesis supports many more languages, but it’s clearly artificial. You can experiment to find the right set of parameters to give it the feel you want, tuning speed, pitch, and other settings—including adding pauses to give a more natural feel.
To get the most out of text-to-speech, you’ll probably be using it in an app that calls multiple Cognitive Services: perhaps using speech recognition to extract requests from a user, passing them through LUIS to generate intents that can be used in an application, and then delivering responses using neural voices or your own custom voice. If you offer that to people as a chatbot, consider using the Azure Bot Service, which offers an integrated experience for working with those services together.
Use the Speech SDK from C# (using the .NET Standard-based SDK that works on Windows, Linux, macOS, Mono, Xamarin, UWP, and Unity), C++, Java, Go, Python, Objective-C/Swift, or JavaScript to give your applications access to speech-to-text, text-to-speech, speech translation, and intent recognition. Some versions of the SDK support different features, and you may find that more complex operations require using direct access to the Speech Service APIs.
Translation and unified speech
One of the first deep learning services that Microsoft demonstrated, when today’s Cognitive Services were still just projects inside Microsoft Research, was the real-time speech translation tools. Using a modified version of Skype, an English speaker could communicate with a Chinese speaker in real time, using subtitles to translate the conversation.
Now those translation services have gone from research to product to service—for example, in the Microsoft Translator app as shown in Figure 4-6—and speech translation SDKs in Speech Services let you add real-time translation services to your C#, C++, and Java applications. Using neural machine translation techniques rather than the traditional statistical approach, this delivers much higher-quality translations using a large training set of millions of translated sentences.
The speech translation tool uses a four-step process, starting with speech recognition to convert spoken words into text. The transcribed text is then passed through a TrueText engine to normalize the speech and make it more suitable for translation. Next, the text is passed through the machine translation tools using conversation-optimized models, before being delivered as text or processed into voice through the Speech Services text-to-speech tools. The actual translation is done by the Translator Text API, which we covered in detail in “Translator”.
The speech translation tools work in a similar fashion to the standard speech recognition tools, using a TranslationRecognizer object to work with audio data. By default, it uses the local microphone, though you can configure it to use alternative audio sources. To make a translation, you set both source and target languages, using the standard Windows language types (even if your app doesn’t run on Windows).
You’ll need to install the Azure Speech Services SDK to work with the translation tools, storing your key and region details as environment variables. For Python, use:
pip install azure-cognitiveservices-speech
With that in place, set to and from languages, before running a speech recognizer and converting speech to translated text. The sample code here uses your PC’s microphone to translate from French to Brazilian Portuguese. You can choose multiple target languages, especially if you’re serving a diverse group of users. The translated text can be delivered to a speech synthesizer if necessary:
import os import azure.cognitiveservices.speech as speechsdk speech_key, service_region = os.environ['SPEECH__SERVICE__KEY'], os.environ['SPEECH__SERVICE__REGION'] from_language, to_languages = 'fr', 'pt-br' def translate_speech_to_text(): translation_config = speechsdk.translation.SpeechTranslationConfig( subscription=speech_key, region=service_region) translation_config.speech_recognition_language = from_language translation_config.add_target_language(to_language) recognizer = speechsdk.translation.TranslationRecognizer( translation_config=translation_config) print('Say something...') result = recognizer.recognize_once() print(get_result_text(reason=result.reason, result=result)) def get_result_text(reason, result): reason_format = { speechsdk.ResultReason.TranslatedSpeech: f'RECOGNIZED "{from_language}": {result.text}\n' + f'TRANSLATED into "{to_language}"": {result.translations[to_language]}', speechsdk.ResultReason.RecognizedSpeech: f'Recognized: "{result.text}"', speechsdk.ResultReason.NoMatch: f'No speech could be recognized: {result.no_match_details}', speechsdk.ResultReason.Canceled: f'Speech Recognition canceled: {result.cancellation_details}' } return reason_format.get(reason, 'Unable to recognize speech') translate_speech_to_text()
Translations are delivered as events, so your code needs to subscribe to the response stream. The streamed data can either be displayed as text in real time, or you can use it to produce a synthesized translation, using neural speech if available. By working with APIs, you can produce in a few lines of code what would have been a large project if implemented from scratch. Similarly, Azure’s cloud pricing model means it’s economical to add speech to applications where you wouldn’t have considered expensive client-side translation services.
The custom translation models we looked at as part of the language APIs are also available for translating speech.
Vision
Want to know what’s in an image or a video? The different vision APIs and services can recognize faces, emotions and expressions, objects and famous landmarks, scenes and activities, or text and handwriting. You can get all the power of a fully trained image recognition deep learning network, and then you can customize it to recognize the specific objects you need with only a few dozen examples. Use that to find patterns that can help diagnose plant disease, or classify the world and narrate it to the blind, or generate metadata summaries that can automate image archiving and retrieval.
The vision APIs and technologies in Cognitive Services are the same as those that power Bing’s image search and OCR text from images in OneNote and index video in Azure Streams. They provide endpoints that take image data and return labeled content you can use in your app, whether that’s the text in a menu, the expression on someone’s face, or a description of what’s going on in a video.
As well as the APIs, there are also SDKs for many popular platforms. If you’re using custom machine learning tools and analytical frameworks like Anaconda or Jupyter Notebooks, there’s support for Python. Windows developers can access the Computer Vision service from .NET, JavaScript via Node.js, Android with Java, and iOS using Swift, and there’s Go support for systems programming.
Behind the APIs are a set of deep neural networks trained to perform functions like image classification, scene and activity recognition, celebrity and landmark recognition, OCR, and handwriting recognition.
Many of the computer vision tasks are provided by a single API namespace, Analyze Image, which supports the most common image recognition scenarios. When you make a call to the different endpoints in the API namespace, the appropriate neural network is used to classify your image. In some cases, this may mean the image passes through more than one model, first to recognize an object and then to extract additional information. That way you can use a picture of the shelves in a supermarket to identify not only the packaging types on display but also the brands being sold and even whether the specific products are laid out in the right order on the shelf (something that’s time-consuming and expensive to audit manually).
The Analyze Image API attempts to detect and tag various visual features, marking detected objects with a bounding box. Use the Vision API in the sample Cognitive Services kiosk to experiment with the API, as in Figure 4-7. Those features include:
-
Tagging visual features
-
Detecting objects
-
Detecting brands
-
Categorizing images
-
Describing images
-
Detecting image types
-
Detecting domain-specific content
-
Detecting color schemes
-
Generating thumbnails
-
Detecting areas of interest
You can call the Analyze Image endpoint to group many of these tasks together—for example, extracting tags, detecting objects and faces—or you can call those features individually by using their specific endpoints. Other operations, like generating thumbnails, require calling the task-specific endpoint.
For more advanced requirements than simple face recognition in the Computer Vision API, use the separate Face API to compare two faces, to search by face for images of the same person in an archive, or to compare a selfie to a set of stored images to identify someone by their face instead of a password. When you want to understand movements and presence in a physical space, the Spatial Analysis APIs ingest video from CCTV or industrial cameras, detect and track people in the video as they move around, and generate events as they interact with the regions of interest you set in the space. You can use this to count the number of people entering a space, see how quickly they move through an area, or track compliance with guidelines for social distancing and mask wearing.
Warning
It’s particularly important to use face recognition, spatial analysis, and video analysis services responsibly: check out the guidance in Chapter 7 for how to approach this.
To get started with Computer Vision, download the SDK, using pip. You’ll also need the pillow image processing library.
pip install --upgrade azure-cognitiveservices-vision-computervision pip install pillow
With the SDK and required components in place, you can start to write code. First import libraries, and then add your key and endpoint URL, before authenticating with the service:
from azure.cognitiveservices.vision.computervision import ComputerVisionClient from azure.cognitiveservices.vision.computervision.models import OperationStatusCodes from azure.cognitiveservices.vision.computervision.models import VisualFeatureTypes from msrest.authentication import CognitiveServicesCredentials from array import array import os from PIL import Image import sys import time subscription_key = "PASTE_YOUR_COMPUTER_VISION_SUBSCRIPTION_KEY_HERE" endpoint = "PASTE_YOUR_COMPUTER_VISION_ENDPOINT_HERE" computervision_client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(subscription_key))
With this in place, you’re now ready to analyze an image. We’ll use an image URL as a start:
remote_image_url = "INSERT_IMAGE_URL_HERE"
Our application will use the object detection feature of the Computer Vision API. Once the image has been processed, it will display details of what has been detected and where. You could use this data to quickly add overlay boxes and captions to an image:
print("===== Detect Objects - remote =====") detect_objects_results_remote = computervision_client.detect_objects(remote_image_url) print("Detecting objects in remote image:") if len(detect_objects_results_remote.objects) == 0: print("No objects detected.") else: for object in detect_objects_results_remote.objects: print("object at location {}, {}, {}, {}".format( \ object.rectangle.x, object.rectangle.x + object.rectangle.w, \ object.rectangle.y, object.rectangle.y + object.rectangle.h))
Tip
Most of the computer vision tools return machine-readable information, but sometimes you need text that can be used as a caption or readout in an audio description. Call this capability either with the /analyze endpoint or the standalone /describe endpoint. Descriptions are returned as a JSON document in an ordered list in terms of confidence, along with associated tags that can be used for extra context.
When you request tags for an image using Image Analysis, the data returned is a word list you can use to classify images, like making a gallery of all the images in a set that contain car parts, or are taken outdoors. By providing multiple tags for an image, you can create complex indexes for your image sets that can then be used to describe the scene depicted, or find images of specific people, objects, or logos in an archive.
We can add the following snippet to our object recognition code to generate a list of object tags along with their confidence level:
print("===== Tag an image - remote =====") # Call API with remote image tags_result_remote = computervision_client.tag_image(remote_image_url ) # Print results with confidence score print("Tags in the remote image: ") if (len(tags_result_remote.tags) == 0): print("No tags detected.") else: for tag in tags_result_remote.tags: print("'{}' with confidence {:.2f}%".format(tag.name, tag.confidence * 100))
To use this API, you need to upload a still image or provide a link to an image URL. The API returns a JSON document that contains recognized objects, along with a confidence value you can use as a cutoff to define when to apply a tag (or when to show the tag to your users). Pick a high threshold to avoid false positives and poor matches cluttering up tags and search results.
Object detection also takes an image or URL; it returns the bounding box coordinates for objects and the relationship between them: whether a “tree” is next to a “house” or a “car” is in front of a “truck.” The brand detection API is a specialized version for product logos. If you want to improve recognition for specific classes of images, you can train a custom vision model: we cover the steps for doing that in the next chapter.
Image categorization is a much higher-level approach than the other image classification tools: it’s useful for filtering a large image set to see if a picture is even relevant and whether you should be using more complex algorithms. Similarly, the Analyze Image API can tell you if an image is a photo, clip art, or line art.
Decision Making
Need to detect problems and get warnings when something starts to go wrong? You can use the Anomaly Detector API for spotting fraud, telling when the sensor in an IoT device is failing, catching changing patterns in services or user activity, detecting an outage as it starts, or even looking for unusual patterns in financial markets. This is the anomaly detection Microsoft uses to monitor dozens of its own cloud services, so it can handle very large-scale data.
Designed to work with real-time or historical time series data, using individual or groups of metrics from multiple sensors, the API determines whether a data point is an anomaly and whether it needs to be delivered as an alert, without you needing to provide labeled data.
If you’re using Python with the anomaly detector, you’ll also need to install the Pandas data analysis library. Use pip to install it and the Azure anomaly SDK. The code snippet here also uses local environment variables to store your keys and endpoint data. Please create these before running the application. You’ll also need a CSV file with time series data, giving your code a path to the data.
This code will analyze a set of time series data with a daily granularity, looking for anomalies in the data. It will then indicate where in the file an anomaly was found, allowing you to pass the data on for further analysis, alerting those responsible for the equipment or service generating the data:
import os from azure.ai.anomalydetector import AnomalyDetectorClient from azure.ai.anomalydetector.models import DetectRequest, TimeSeriesPoint, TimeGranularity, AnomalyDetectorError from azure.core.credentials import AzureKeyCredential import pandas as pd SUBSCRIPTION_KEY = os.environ["ANOMALY_DETECTOR_KEY"] ANOMALY_DETECTOR_ENDPOINT = os.environ["ANOMALY_DETECTOR_ENDPOINT"] TIME_SERIES_DATA_PATH = os.path.join("./sample_data", "request-data.csv") client = AnomalyDetectorClient(AzureKeyCredential(SUBSCRIPTION_KEY), ANOMALY_DETECTOR_ENDPOINT) series = [] data_file = pd.read_csv(TIME_SERIES_DATA_PATH, header=None, encoding='utf-8', parse_dates=[0]) for index, row in data_file.iterrows(): series.append(TimeSeriesPoint(timestamp=row[0], value=row[1])) request = DetectRequest(series=series, granularity=TimeGranularity.daily) print('Detecting anomalies in the entire time series.') try: response = client.detect_entire_series(request) except AnomalyDetectorError as e: print('Error code: {}'.format(e.error.code), 'Error message: {}'.format(e.error.message)) except Exception as e: print(e) if any(response.is_anomaly): print('An anomaly was detected at index:') for i, value in enumerate(response.is_anomaly): if value: print(i) else: print('No anomalies were detected in the time series.')
The Personalizer service uses reinforcement learning to pick what product to recommend to online shoppers, what content to prioritize for a specific visitor, or where to place an ad. It can work with text, images, URLs, emails, chatbot responses, or anything where there’s a short list of actions or items to choose from, enough contextual information about the content to use for ranking, and enough traffic for the service to keep learning from. Every time a Personalizer pick is shown, the service gets a reward score between 0 and 1, based on how the shopper or reader reacted—did they click the link, scroll to the end or buy the product, pick something different or look around and then chose what was offered—that’s used to improve the already-trained model. We’ll see the Personalizer service in action in Chapter 12, where it powers recommendations in an online marketplace.
Content moderation
Whether you want to keep a chat room family friendly or make sure your ecommerce site doesn’t offer products with unfortunate or offensive phrases printed on them, content moderation services can help. The Image and Video Indexer APIs can detect adult or “racy” content that might not be suitable for your audience. There’s also an image moderation tool that can spot images that might be offensive or unpleasant, including using OCR to look for offensive language.
Images and video are uploaded to the service and passed to the Analyze Image API. Two Booleans are returned: isAdultContent and isRacyContent, along with confidence scores.
Start by installing the content moderator library via pip:
pip install --upgrade azure-cognitiveservices-vision-contentmoderator
You can now start to build a service that works with Azure to moderate content on your site. Here we’re providing a list of images to check for identifiable faces:
import os.path from pprint import pprint import time from io import BytesIO from random import random import uuid from azure.cognitiveservices.vision.contentmoderator import ContentModeratorClient import azure.cognitiveservices.vision.contentmoderator.models from msrest.authentication import CognitiveServicesCredentials
CONTENT_MODERATOR_ENDPOINT = "PASTE_YOUR_CONTENT_MODERATOR_ENDPOINT_HERE" subscription_key = "PASTE_YOUR_CONTENT_MODERATOR_SUBSCRIPTION_KEY_HERE"
client = ContentModeratorClient( endpoint=CONTENT_MODERATOR_ENDPOINT, credentials=CognitiveServicesCredentials(subscription_key) )
IMAGE_LIST = [ "image_url_1”, "image_url_2" ] for image_url in IMAGE_LIST: print("\nEvaluate image {}".format(image_url)) print("\nDetect faces.") evaluation = client.image_moderation.find_faces_url_input( content_type="application/json", cache_image=True, data_representation="URL", value=image_url ) assert isinstance(evaluation, FoundFaces) pprint(evaluation.as_dict())
Content moderation isn’t only for images; it can also work with text content. This can find more than adult or racy content; as well as offensive language, including looking for terms that are misspelled (maybe on purpose to evade moderation), it scans for personally identifiable information (PII) that’s subject to regulation in many jurisdictions. You can add custom terms—for example, if you don’t want to include posts that mention competing brands. You create the API wrapper in the same way as for images. The more comprehensive Azure Content Moderator service includes custom lists for content that’s often submitted that you don’t need to classify every time and can reject straight away.
Tip
Which Cognitive Services decision models can you customize?
- Metrics Advisor (you must be logged in to an Azure account to open this URL)
- Personalizer: customize your model in the Azure portal under Personalizer.
Wrapping It Up
In this chapter, we’ve looked at what you can achieve with the Azure Cognitive Services with prebuilt or customized models that you call through APIs or SDKs—but we’ve looked at them as separate options, and that may not be what you’ll want to do in a real application. Individual Cognitive Services are powerful, but often you will want to combine multiple Cognitive Services to handle broader scenarios. You can do that yourself in code, but there are some services that developers use together so commonly that Microsoft has bundled them into Applied AI Services. Read on to learn what you can do with them.
1 If you want to more details about the different Cognitive Services and how you use them, see the online documentation or check out our previous book, Building Intelligent Apps with Cognitive APIs.
Get Azure AI Services at Scale for Cloud, Mobile, and Edge now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.