Chapter 1. Why Voice First
Notwithstanding the massive adoption of laptops and smartphones, and the ubiquity of screens—whether gargantuan billboards or tiny ones on smartwatches—voice remains by far the medium that humans use the most to communicate with one another. We speak far more often than we type or touch. This even applies to a typical member of the Digital Generation. Watch gamers and note how much they verbalize their thoughts and feelings as they play; and note that those who watch them attend the sessions to listen to them talk as well as watch them play.
There are several reasons why voice is the predominant mode of communication. Here’s a list of the key ones.
Eyes-Free
Unlike reading, listening does not require your eyes to be focused on anything to receive the information. We can have a conversation with our eyes closed. This opens up a world of possibilities. Communications can take place while doing any of the following: being in dark rooms; driving; watching TV; reading; taking a walk with someone else; potting a plant; admiring a landscape, typing; lying on the grass with hands crossed behind your head, side-by-side with a dear friend, staring at the sky; and so on.
Hands-Free
Similarly, unlike writing or typing, speaking does not require us to use our hands. We can hold a conversation with our hands occupied doing something: holding a book, typing on a laptop, preparing food, folding laundry, potting flowers, putting on our shoes, combing our hair, cutting coupons, putting on mittens, washing our hands, taking a bath or a shower, and so on.
Ephemerality
Unlike things that we type or images that we look at, a pure voice communication comes and goes, leaving no trace and nothing to clean up after the interaction is done. There are times when that is a limitation (think about the process of booking a flight or renting a car using the voicebot), but at other times, such ephemerality is a good thing: I get a piece of information, I respond, the thing gets it done, and I am back to my life stream; no text boxes or notifications to clear out, no browser tabs to kill.
Wealth
It is often said that a picture is worth a thousand words. But how many words—or even pictures—are worth a voiced, spoken word?
Take this sentence: “That’s great—that’s all we need!”
Let’s say that that sentence was my friend’s email response to a longer email I had sent them. Would you be able to tell me the meaning of that utterance just by looking at the text? Probably not. It could mean, “That’s great! This is good news. Now, let’s make the most of it,” in response to my note: “We just got $500k in Angel funding!” Or it could be ironic: “That’s great!—that’s all we needed, wasting another daylong meeting talking nonsense” in response to my email, “Looks like George wants to do another daylong offsite.”
Now imagine hearing the response: “That’s great—that’s all we need!” Chances are that you would probably be able to tell me, with minimal prior insight into the original email that I had sent, what the meaning of the response is: was it an expression of delight and an enthusiastic call to action, or bitter sarcasm?
In addition, you would probably be able to tell me whether the speaker is a man or a woman and, if the speaker is a mutual friend, you would be able to immediately identify them.
In general, a piece of audio is far more than simply spoken text. It can also communicate:
-
Gender
-
Identity
-
Age
-
Personality
-
Mood/emotion
-
Emphasis
-
Ethnicity/region (through accent)
Minimal Effort
In normal circumstances, uttering a few words to a smart speaker or to your AirPods is far less expensive for the user than typing text or navigating (tapping or swiping) on a small screen. No laptops or smartphones that need to be found, powered up, turned on, and signed into. If you can speak, you just speak.
Broadcasting
Unless I am writing on a whiteboard for an audience that is looking at my whiteboard, my writing is usually private or directed at a specific set of recipients. For instance, I send a written email to a specific set of coworkers, or I text with some specifically selected persons. In the case of speech, unless I consciously take precautions to limit my audience (close the door, speak softly), my spoken words are broadcast through physical space to whoever is within earshot. Often, this presents an issue: privacy. But at other times, the broadcast character of voice is an asset. For instance, let’s say I have a smartphone application that tells jokes. In a setting where I have friends around me, sharing that joke by passing my smartphone around is not as compelling as having that joke voiced to everyone at the same time through my smart speaker.
Nonliteracy
With the spoken word, you don’t need to know how to read or write in order to communicate. Think of the toddler being able to express their feelings, needs, and wants. Think also of the adult who is not literate enough to comfortably read or write. Unlike all other forms, voice does not expect you to have been trained in any special language, other than your mother tongue.
Get The Elements of Voice First Style now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.