CHAPTER 25

image

STATISTICAL SEQUENCE RECOGNITION

25.1 INTRODUCTION

In Chapter 24 we showed how temporal integration of local distances between acoustic frames could be accomplished efficiently by dynamic programming. This approach not only integrates the matches between incoming speech and representations of the speech used in training, but it also normalizes time variations for speech sounds. In the case of continuous speech, this approach also effectively segments the speech as part of the recognition search, without the need for any explicit segmentation stage. Distances can also be modified to reflect the relative significance of different signal properties for classification.

However, there are a number of limitations to the DTW-based sequence-recognition approaches described in the last chapter. As noted previously, a comparison of templates requires end-point detection, which can be quite error prone with realistic acoustic conditions. Although in principle distances can be computed to correspond to any optimization criterion, without a strong mathematical structure it is difficult to show the effect on global error for an arbitrary local distance criterion. Since continuous speech is more than just a concatenation of individual linguistic elements (e.g., words or phones), we need a mechanism to represent the dependencies of each sound or category on the neighboring context. ...

Get Speech and Audio Signal Processing: Processing and Perception of Speech and Music, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.