8.9 Trends and Open Problems in LID
Over the past 10–15 years, language identification has become a well-
established branch of speech research, with regular conference sessions
dedicated to its problems, developments, and applications. Recent LID
evaluations organized by the National Institute of Standards and Tech-
nology (NIST) in 2003 and 2005, as a continuation of previous such
evaluations from 1995–96, contributed to continued attention to this
problem and spurred new activity in this area.
We have discussed how the main approaches can be described in
terms of the signal features they seek to exploit as well as in terms of
their modeling techniques. However, the various degrees of difficulty
in the LID experiments, such as test duration, amount of training data
and data annotation, channel quality, and test training mismatch, render
a consistent comparison of the individual algorithms difficult. Based on a
somewhat rough comparison, a performance-oriented ranking will place
the LVCSR-based systems among the most powerful methods, followed
by the phonotactic, the acoustic, and finally the prosodic approaches.
A major trend in current LID methodology is system and component
fusion. While a system fusion aims at combining entire LID systems, the
component fusion exploits the fact that different sources of discriminative
information within one LID system provide partially decorrelated output
and therefore may lead to improved performance when integrated into one
decision process. Most common fusion implementations combine outputs
of the individual components, such as phonotactic, acoustic, and prosodic
modeling, into a final hypothesis by means of a higher-level classifier. This
classifier can be a simple linear function (Hazen and Zue, 1997) or a non-
linear classifier, such as a multilayer perceptron (Navrátil, 2001; Yan et al.,
An interesting and equally important aspect of comparing LID
approaches is the intended application. Differentapplications pose different
demands due to their choice of languages, typical test duration, availability
of training material, and so on. Table 8.6 summarizes some of the advan-
tages distinguishing individual methods from several application-oriented
While LVCSR-based LID systems can be considered the most accu-
rate solution to LID today (Hieronymus and Kadambe, 1997), their
high computational cost as well as the requirement for sufficient
Table 8.6 Comparison of basic LID approaches from an application
development aspect.
Basic Strength Constraint Example
Approach Application
Prosodic Robust in channel Mostly suitable Preclassifier of
mismatch for distinguishing tonal versus stress
language groups languages
Acoustic Low cost in training Usable in Language and accent
and testing (data combination with ID in a multiapproach
and computation) other components system
Phonotactic Good performance- Useful for tests LID system with large
-to-cost ratio, with duration language population,
No linguistic >5 seconds including rare languages
knowledge without linguistically
required to train labeled training data
LVCSR/ High accuracy, Significant Multilingual
Keyword Short tests training effort, dialog systems with
Linguistic input LVCSR components,
required Audio mining systems
word-transcribed training data, limits their area of use. Potential appli-
cation scenarios are multilingual dialog or translation systems, in which
a multilingual LVCSR infrastructure is already in place, independently
of the LID task. In order to reduce computational demands, speech sig-
nals can first be decoded by all LVCSR systems in parallel, followed by
a process of hypothesis pruning and eventual convergence on a single
language hypothesis. In the extreme case of systems with limited vocab-
ularies and finite-state grammars, the LID task can be implemented as an
implicit by-product of word recognition by unifying all words in a common
(multilingual) dictionary.
In terms of performance-to-cost ratio, the phonotactic approaches
seem to fare best. Their main advantage is that they are capable of
modeling arbitrary languages in terms of sound units of other lan-
guages (Zissman, 1996) and hence do not require manual labeling and
segmentation. They therfore represent an acceptable solution for many
practical applications—especially those involving large numbers of lan-
guages or rare and resource-poor languages. The many improvements in
the area of N-gram model smoothing (Section 8.6.2), context clustering

Get Multilingual Speech Processing now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.