Chapter 5. Speech

Speech recognition has long been one of the more complex computer science problems—but years of research and recent breakthroughs with deep learning neural networks have turned this from a research problem into a set of easy-to-use services. The very first successful implementation of deep learning instead of the traditional speech recognition algorithms was funded by Microsoft Research. In 2017, a system built by Microsoft researchers outperformed not just individuals but a more accurate multitranscriber process at transcribing recorded phone conversations.

The Cognitive Services Speech Services are built upon these innovations: they provide a set of pretrained speech APIs that work across multiple speakers and many different languages. Add them to your code and you’re using the same engines that power Microsoft’s own services, from Skype’s real-time translation tools to PowerPoint’s live captioning.

The Speech Services include speech-to-text, text-to-speech, voice identification, and real-time translation capabilities. Combined, these features make it easy to add natural interaction to your apps and let your users communicate in whatever way they find convenient.

The services are available through the Speech SDK, the Speech Devices SDK, or REST APIs. These cloud APIs enable speech-to-text translation in just a few lines of code, making it economical to add these capabilities in applications where client-side translation services would have been considered too ...

Get Building Intelligent Apps with Cognitive APIs now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.