Chapter | 3 Speech Processing
57
however. State-of-the art systems are very sensitive to both intrinsic
(for instance, non-native speech) and extrinsic (background noise)
sources of variation. For instance, word error rates are still as high
as 10% for digit strings recognition in a car driven on a highway and
using a microphone positioned in the dashboard, while it is below
1% when the noise level is very low. Current research is nevertheless
attempting to tackle these vulnerabilities, as well as the assumptions
and approximations inherent in current technology, with a long-
term goal of reaching human-like capabilities. This will hopefully
be evidenced by an increase in the range of applications that will be
supported by these technologies in the near future.
For speech synthesis, today’s winning paradigm in the industry
5
is
still unit-selection, while more rudimentary diphone-based systems
are still used for low cost consumer products. Statistical parametric
synthesis offers very promising results and allows a degree of flexibil-
ity in speech control that is not possible with unit selection. It should
quickly supersede diphone-based synthesis in industrial applications
for which a small footprint is a requirement.
REFERENCES
1. V. Tyagi, C. Wellekens, H. Bourlard, On variable-scale piecewise stationary
spectral analysis of speech signals for ASR. Proceedings of Interspeech 2005,
2005, pp. 209–212.
2. K.K. Paliwal, Usefulness of phase in speech processing Proceedings of IPSJ
Spoken Language Processing Workshop, 2003, pp. 1–6.
3. H. Bourlard, N. Morgan, Connectionist Speech Recognition: A Hybrid Appr-
oach. Kluwer Academic Publishers, 1994.
4. G. Zweig, Speech Recognition with Dynamic Bayesian Networks, Ph.D. Thesis.
U.C., Berkeley, CA, 1988.
5. C. Leggetter, P. Woodland, Maximum likelihood linear regression for speaker
adaptation of continuous density Hidden Markov Models, Comput. Speech
Lang. 9 (2) (1995) 171–185.
5. Some speech synthesisers are also available for free for research purposes.
See for instance the MBROLA (http://tcts.fpms.ac.be/synthesis/mbrola.html), FES-
TIVAL (http://www.cstr.ed.ac.uk/projects/festival/), FestVox (http://festvox.org/) or
FreeTTS (http://freetts.sourceforge.net/docs/) projects. See also TTSBOX, the tuto-
rial MatlabTM toolbox for TTS synthesis (http://tcts.fpms.ac.be/projects/ttsbox/).
58
PART|I Signal Processing, Modelling and Related Mathematical Tools
6. M.J.F. Gales, Cluster adaptive training of hidden Markov models, IEEE Trans.
Speech Audio Process. 8 (4) (2000) 417–428.
7. H. Yamamoto, S. Isogai, Y. Sagisaka, Multi-class composite n-gram language
model, Speech Commun. 41 (2–3) (2003) 369–379.
8. X. Huang, A. Acero, H.-W. Ho, Spoken Language Processing: A Guide to
Theory, Algorithm, and System Development. Prentice-Hall, 2001.
9. P. Heracleous, Y. Nakajima, A. Lee, H. Saruwatari, K. Shikano, Accurate hidden
markov models for non-audible murmur (nam) recognition based on iterative
supervised adaptation, in: Proceedings ofASRU2003, U.S.VirginIslands, 2003,
pp. 73–76.
10. O.M. Strand, T. Holter, A. Egeberg, S. Stensby, On the feasibility of ASR in
extreme noise using the parat earplug communication terminal, 2003.
11. M. Graciarena, H. Franco, K. Sonmez, H. Bratt, Combining standard and throat
microphones for robust speech recognition, IEEE Signal Process. Lett. 10 (3)
(2003) 72–74.
12. Y. Zheng, Z. Liu, Z. Zhang, M. Sinclair, J. Droppo, L. Deng, et al., Air-
and bone-conductive integrated microphones for robust speech detection and
enhancement. Proceedings of ASRU 2003, U.S. Virgin Islands, 2003, pp.
249–254.
13. S. Dupont, C. Ris, Combined use of close-talk and throat microphones for
improved speech recognition under non-stationary background noise. Proceed-
ings of Robust 2004 (Workshop (ITRW) on Robustness Issues in Conversational
Interaction), Norwich, UK, August 2004.
14. G. Potamianos, C. Neti, J. Luettin, I. Matthews, Audio-visual automatic speech
recognition: an overview. Issues in Visual and Audio-Visual Speech Processing,
2004.
15. S.J. Cox, S. Dasmahapatra, High level approaches to confidence estimation
in speech recognition, IEEE Trans. Speech Audio Process, 10 (7) (2002)
460–471.
16. H. Jiang, Confidence measures for speech recognition: a survey, Speech
Commun. 45 (4) (2005) 455–470.
17. P.C. Loizou, I. Cohen, S. Gannot, K. Paliwal, Special issue on speech enhance-
ment. Speech Commun. Spec. Issue. 49 (7–8) (2007) 527–529.
18. F. Nolan, The Phonetic Bases of Speaker Recognition. Cambridge University
Press, 1983.
19. P.L. Garvin, P. Ladefoged, Speaker identification and message identification in
speech recognition. Phonetica, 9 (1963) 193–199.
20. M. Benzeghiba, R.D. Mori, O. Deroo, S. Dupont, T. Erbes, D. Jouvet, et al.,Auto-
matic speech recognition and speech variability: a review. Speech Commun.
49 (10–11) (2007) 763–786.

Get Multi-Modal Signal Processing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.