Once your brain has decided to classify a sound as speech, it brings online a raft of tricks to extract from it the maximum amount of information.
Speech isn’t just another set of noises. The brain treats it very differently from ordinary sounds. Speech is predominantly processed on the left side of the brain, while normal sounds are mostly processed on the right.
Note
This division is less pronounced in women, which is why they tend to recover better from strokes affecting their left-sided language areas.
Knowing you’re about to hear language prepares your brain to make lots of assumptions specially tailored to extract useful information from the sound. It’s this special way of processing language-classified sounds that allows our brains to make sense of speech that is coming at us at a rate of up to 50 phonemes a second—a rate that can actually be produced only using an artificially sped-up recording.
To hear just how much the expectation of speech influences the sounds you hear, listen to the degraded sound demos created by Bob Shannon et al. at the House Ear Institute ( http://www.hei.org/research/aip/audiodemos.htm ).
In particular, listen to the MP3 demo that starts with a voice that has been degraded beyond recognition and then repeated six times, each time increasing the quality ( http://www.hei.org/research/aip/increase_channels.mp3 ).
You won’t be able to tell what the voice is saying until the third or fourth repetition. Listen to the MP3 again. This time your brain knows what to hear, so the words are clearer much earlier. However hard you try, you can’t go back to hearing static.
Sentences are broken into words having meaning and organized by grammar, the system by which we can build up an infinite number of complex sentences and subtle meanings from only a finite pool of words.
Words can be broken down too, into morphemes, the smallest units of meaning. “-ing” is a morpheme and makes the word “run” become “running.” It imparts meaning. There are further rules at this level, about how to combine words into large words.
Morphemes, too, can be broken down, into phonemes. Phonemes are the basic sounds a language uses, so the word “run” has three: /r u n/. They don’t map cleanly onto the letters of the alphabet; think of the phoneme at the beginning of “shine.” Phonemes are different from syllables. So the word “running” is made up of two morphemes and has five phonemes, but just two syllables (and seven letters of course).
Languages have different sets of phonemes; English has about 40–45. There are more than 100 phonemes that the human mouth is capable of making, but as babies, when we start learning language, we tune into the ones that we encounter and learn to ignore the rest.
People speak at about 10–15 phonemes per second, 20–30 if they’re speaking fast, and that rate is easily understood by native speakers of the same language (if you fast-forward recorded speech, we can understand up to 50 phonemes per second). Speech this fast can’t contain each sound sequentially and independently. Instead, the sounds end up on top of one another. As you’re speaking one phoneme, your tongue and lips are halfway to the position required to speak the next one, anticipating it, so words sound different depending on what words are before and after. That’s one of the reasons making good speech recognition software is so hard.
The other reason software to turn sounds into words is so hard is that the layers of phonemes, morphemes, and words are messy and influence one another. Listeners know to expect certain sounds, certain sound patterns (morphemes), and even to expect what word is coming next. The stream of auditory input is matched against all of that, and we’re able to understand speech, even when phonemes (such as /ba/ and /pa/, which can also be identified by looking at lip movements [[Hack #59]]) are very similar and easily confused. The lack of abstraction layers—and the need to understand the meaning of the sentence and grammar just to figure out what the phonemes are—is what makes this procedure so hard for software.
It’s yet another example of how expectations influence perception, in a very fundamental way. In the case of auditory information, knowing that the sound is actually speech causes the brain to route the information to a completely separate region than the one in which general sound processing takes place. When sound is taken to the speech processing region, you’re able to hear words you literally couldn’t possibly have heard when you thought you were just hearing noise, even for the same sound.
To try this, play for a friend synthesized voices made out of overlapping sine-wave sounds ( http://www.biols.susx.ac.uk/home/Chris_Darwin/SWS ). This site has a number of recorded sentences and, for each one, a generated, artificial version of that sound pattern. It’s recognizable as a voice if you know what it is, but not otherwise.
When you play the sine-wave speech MP3 (called SWS on the site) to your friend, don’t tell her it’s a voice. She’ll just hear a beeping sound. Then let her hear the original voice of the same sentence, and play the SWS again. With her new knowledge, the sound is routed to speech recognition and will sound quite different. Knowing that the sound is actually made out of words and is English (so it’s made out of guessable phonemes and morphemes), allows the whole recognition process to take place, which couldn’t have happened before.
Mondegreens occur when our phoneme recognition gets it oh-so-wrong, which happens a lot with song lyrics—so called from mishearing “and laid him on the green” as “and Lady Mondegreen” ( http://www.sfgate.com/cgi-bin/article.cgi?file=/chronicle/archive/1995/02/16/DD31497.DTL ). SF Gate keeps an archive of misheard lyrics ( http://www.sfgate.com/columnists/carroll/mondegreens.shtml ).
Get Mind Hacks now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.