Chapter 4. Speech Recognition Technology
WE’VE TALKED ABOUT MANY of the crucial voice user interface (VUI) design elements. So far, they’ve been light on the technical details of speech recognition technology itself. This chapter gets more technical, looking under the hood so that you can make sure your VUI design takes into account (and takes advantage of) the technology itself. It will also give you the ability to confidently reference the underlying technology when explaining your design decisions.
To create a VUI, your app must have one key component: automated speech recognition (ASR). ASR refers to the technology by which a user speaks, and their speech is then translated into text.
Choosing an Engine
So how to choose your ASR tool? There are free services as well as those that require licensing fees. Some offer free use for development but require payment for commercial use.
As of this writing, there are two major fee-based speech recognition engines: Google and Nuance. Other options in this space include Microsoft’s Bing and iSpeech.
Free ASR tools include the Web Speech API, Wit.ai, Sphinx (from Carnegie Mellon), and Kaldi. Amazon has its own tool, but at the moment it can be used only when creating skills for the Amazon Echo (which is free).
Wikipedia has a much more detailed list, which you can access at https://en.wikipedia.org/wiki/List_of_speech_recognition_software.
Some companies offer multiple engines, as well; for example, Nuance has different offerings depending on what ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access