January 2019
Intermediate to advanced
386 pages
11h 13m
English
Speech recognition tries to find a transcription of the most probable word sequence considering the acoustic observations provided:
transcription = argmax(P(words | audio features))
This probability function is typically modeled in different parts (note that the normalizing term P (audio features) is usually ignored):
P (words | audio features) = P (audio features | words) * P (words)
= P (audio features | phonemes) * P (phonemes | words) * P (words)
Each of these probability ...