Acoustic discs suspended from the roof of London's Royal Albert Hall.
Acoustic discs suspended from the roof of London's Royal Albert Hall. (source: Egghead06 on Wikimedia Commons)

In this episode of the O’Reilly Data Show, I spoke with Yishay Carmiel, president of Spoken Labs. As voice becomes a common user interface, the need for accurate and intelligent speech technologies has grown. And although computer vision is a common entry point for deep learning, some of the most interesting commercial applications of deep neural networks are in speech recognition. Carmiel has spent several years building commercial speech applications, and along the way he has witnessed (and helped architect) massive improvements in speech technologies.

Here are some highlights from our conversation:

Speech recognition

Speech recognition is divided into three components. You have the signal part—when you have a speech signal and you're trying to create to extract speech segments because the signal can be noisy. So, you try to extract speech segments and extract features, like every machine learning problem. We have in speech what we call the acoustic level, we try to classify these features into different type of sounds. Also, we assume that if the person is saying something, he's saying words, and words come together into sentences. This is the language level—we take these sounds and combine them into words and then combine the words together into some kind of a sentence.

For all of these three levels, right now, a lot of these different algorithms are based on deep learning techniques. That’s why if you want to say I'm using deep learning for speech, it's not just using a single algorithm; it's using a variety of algorithms for all of these levels.

… Most speech recognition systems now rely on deep learning. In the speech community, there is a well-known database called Switchboard. Before deep learning, I think the word error rate was around 24%. Recently, IBM published a paper where the word error rate was below 7%. That’s almost a 75% reduction in the word error rate! That’s why nobody is even trying to use the old methodologies.

Signal processing, dynamic programming, and search

Current speech recognition systems are very sensitive to noisy data—meaning that if I'm talking to my mobile phone from a very close microphone, the performance is amazing, but if I'm talking and the microphone is around 10 feet from me, then the performance degrades rapidly. Among the tools that people are using are advanced signal processing techniques. Recently, people have applied deep learning techniques into signal processing models to get better performance.

You need to get a lot of experience in order to understand speech recognition systems. One of the key challenges is that you start with sounds, and you have different sounds that combine into words, that combine into sentences. ... It becomes a very complicated dynamic programming task. … This is related to very large-scale search problems on graphs. … You need to master a lot of techniques in order to build a very good speech recognition system.

Natural language processing and understanding

When I'm thinking about speech recognition, my main aim is to understand the meaning of what people are saying. People are interacting with other people or with machines through voice. If I just transform everything to text, I won't able to do much with that.

When you are building an AI system, you have two levels. You have one we call the machine perception, which is building the infrastructure, like speech recognition or some kind of computer vision or video analysis. We also have what we call the machine cognition, which is the reasoning. … One of the problems of applying NLP techniques to speech is that most of the NLP algorithms are trained on text corpus, and people talk differently than they write. Basically, all of the NLP models need to be handcrafted or redesigned for speech recognition systems.

Related resources: