January 2019
Intermediate to advanced
386 pages
11h 13m
English
In speech recognition, we want to output the words being spoken as text. We can do this by learning a time-dependent model that takes in a sequence of audio features, as described in the previous section, and outputs a sequential distribution of possible words being spoken. This model is called the acoustic model.
The acoustic model tries to model the likelihood that a sequence of audio features was generated by a sequence of words or phonemes: P (audio features | words) = P (audio features | phonemes) * P (phonemes | words).
A typical speech recognition acoustic model, before deep learning became popular, would use hidden Markov models (HMMs) to model the temporal variability of speech signals (http://mi.eng.cam.ac.uk/~mjfg/mjfg_NOW.pdf ...