was shown to improve the performance, as did the combination of the
articulatory-feature system and the baseline phone-based system.
In summary, Table 8.3 lists selected LID implementations of
phonotactic-only components and their performances spanning the last
decade, in chronological order. Note that despite the common use of the
OGI database, the exact subset used to obtain the error rates may vary in
each study. Furthermore, baseline systems with a variable level of com-
plexity were used, therefore the listed error rates may not be seen as an
outline of progress in phonotactic modeling in an absolute sense.
8.7 Prosodic LID
Prosodic information—such as tone, intonation, and prominence—is pri-
marily encoded in two signal components: fundamental frequency (F
) and
amplitude. Thus, properties of F
and amplitude contours can intuitively be
expected to be useful in automatic LID. In Eady (1982), for instance, two
languages with different prosodic properties were studied: English, which
belongs to the category of languages marking different levels of promi-
nence by means of F
and amplitude, and Chinese, in which tonal patterns
are lexically distinctive (see Chapter 2). The author compared the time
contours of the fundamental frequency and the energy extracted from sen-
tences of the same type (e.g., all declarative sentences) and found unique
language-specific differences—such as a higher rate of change within
each contour—and a higher fluctuation within individual syllables for
However, the use of prosody for the purpose of identifying languages is
not without problems. The difficulty of obtaining a clear assessment of the
usefulness of a prosodic component for LID derives from the large number
of additional factors influencing F
and amplitude. These include:
speaker-specific characteristics (voice type, characteristic speaking
rate, emotional state, health, etc.)
lexical choice (word characteristics, such as word stress and lexical
syntactic content of the utterance (statement, question)
pragmatic content/function in discourse (contrastive emphasis, given
versus new distinction).
Table 8.3 Examples of phonotactic LID systems and their recognition rates.
System Task Test Signal Duration ID Rate Ref.
Interpolated Trigram OGI-11L 10 s/45 s 62.7%/77.5% Hazen and Zue, 1997
Multilingual Tokenizer
Bigrams (1 tokenizer) OGI-10L 10 s/45 s 54%/72% Zissman, 1996
PPRLM (3 tokenizers) OGI-10L 10 s/45 s 63%/79% Zissman, 1996
PPRLM (6 tokenizers) OGI-6L 10 s/45 s 74.0%/84.8% Yan and Barnard, 1995
Extended N-grams OGI-6L 10 s/45 s 86.4%/97.5% Navrátil, 2001
(PPRLM with 6 streams)
OGI-10L mixed 65% Parandekar and Kirchhoff, 2003
GMM-Tokenizer CallFriend (12L) 30 s 63.7% Torres-Carrasquillo et al., 2002a
GMM-Tok.+PPRLM CallFriend (12L) 30 s 83.0% Torres-Carrasquillo et al., 2002a
In many cases, particularly in languages from the same family (or the same
prosodic type; see Section 8.2), the language-specific prosodic variation is
overridden by other, more dominant factors, rendering its extraction and
recognition extremely difficult. In order to successfully exploit prosodic
information, the question of how to best separate language-dependent char-
acteristics from speaker-dependent or other irrelevant characteristics needs
to be addressed and indeed remains one of the open challenges in cur-
rent LID technology. A potentially useful application of prosodic features
might be the preclassification of language samples into broad, prosodically
distinct categories, followed by more fine-grained subclassification using
other methods.
LID systems based solely on prosody are relatively rare. One such
approach was described in Itahashi et al. (1994), and Itahashi and Liang
(1995). In this system, fundamental frequency and energy contours were
extracted from the speech signal and approximated via a piecewise-
linear function. In order to preserve prosodic language characteristics,
a heuristically selected set of variables derived from the line-approximated
representation was calculated (mean slope of the line segments, mean rela-
tive start frequency of a line segment with positive and negative slope, the
correlation coefficient between the fundamental frequency and the energy
contour). Vectors of these features were then processed using principal
components analysis (PCA) to perform a dimensionality reduction. The
PCA-reduced samples were stored and compared to test samples via the
Mahalanobis distance. The system was evaluated on speech material from
six languages (Chinese, English, French, German, Japanese, and Korean);
LID rates ranged between 70 and 100%. The number of speakers used in
the experiments, however, was rather small (5 per language). An identical
approach was tested with data in six languages taken from the OGI-11L
corpus, producing accuracy rates of about 30% (Dung, 1997). Similar
prosodic results were reported by Hazen and Zue (1994). A recent study
used similar modeling of F
-based and rhythmic features from pseudo-
syllabic segments of read speech in ten languages, utilizing a Gaussian
Mixture Model (GMM)–based classifier (Farinas et al., 2002).
Probably the most thorough study of prosodic features for LID was
presented in Thyme-Gobbel and Hutchins (1996). The authors com-
pared 220 prosodically motivated features that were calculated from
syllable segments represented as histograms, and used for classifica-
tion with a likelihood ratio detector. The experiments involved a series
of pair-wise language classification tasks on four languages: English,

Get Multilingual Speech Processing now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.