CHAPTER 23

image

LINGUISTIC CATEGORIES FOR SPEECH RECOGNITION

23.1 INTRODUCTION

In the past few chapters, we have introduced the fundamentals of feature extraction for ASR. The resulting features are gathered into feature vectors that are associated with linguistic categories during training, and then during recognition they are integrated over time to find the best linguistic sequence to assign to the observed sequence of feature vectors.

Here,1 we discuss the linguistic categories that have been used or proposed for use in ASR. Many of these have been alluded to previously (articulatory features, phones, phonemes, syllables, words, phrases, sentences, etc.); here we attempt to define terms more rigorously, particularly with regard to their application in speech engineering. We also highlight some of the research areas relating to the representation of these linguistic categories in ASR, particularly in the context of fluent speech.

23.2 PHONES AND PHONEMES

23.2.1 Overview

Words are a natural unit for modeling in ASR, particularly since there are many applications for which isolated words are an adequate form of input. Even for continuous speech, using complete words as the fundamental linguistic unit permits acoustic modeling of the word-specific context of the sounds used. Consider, for example, a digit recognition task (e.g. for credit card numbers): the system must recognize strings ...

Get Speech and Audio Signal Processing: Processing and Perception of Speech and Music, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.