CHAPTER 24

image

DETERMINISTIC SEQ UENC RECOGNITION FOR ASR

24.1 INTRODUCTION

In the past few chapters, we have established the basics for understanding the static pattern-classification aspect of speech recognition.

  1. Signal representation: in most ASR systems, some function of the local short-term spectrum is used. Typically, this consists of cepstral parameters corresponding to a smoothed spectrum. These parameters are computed every 10 ms or so from a Hamming-windowed speech segment that is 20–30 ms in length. Each of these temporal steps is referred to as a frame.
  2. Classes: in most current systems, the categories that are associated with the short-term signal spectra are phones or subphones,1 as noted in Chapter 23. In some systems, though, the classes simply consist of implicit categories associated with the training data.

Given these choices, one can use any of the techniques described in Chapter 8 to train deterministic classifiers (e.g., minimum distance, linear discriminant functions, neural networks, etc.) that can classify signal segments into one of the classes. However, as noted earlier, speech recognition includes both pattern classification and sequence recognition; recognition of a string of linguistic units from the sequence of segment spectra requires finding the best match overall, not just locally. This would not be so much of a problem if the local match was always ...

Get Speech and Audio Signal Processing: Processing and Perception of Speech and Music, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.