Introduction
Voice is our most primordial means of communication. When we make it out of our mother’s womb, the first thing that we do is to blare out our arrival at the top of our lungs. Our loud cry—our very first public statement—is rich in meaning: it tells the world that we are out, that we have arrived, that we have made it and are healthy; and it reminds everyone within earshot that even though we have many years to go to learn how to speak properly, we will express ourselves forcefully and we will communicate our needs without hesitation. And we will be doing all of this with our voice.
From that day on, we will indeed rely on our voice to communicate not only our needs, but later, to express our pleasure, discomfort, boredom, and delight. By the age of 18 months, we are learning at the astonishing rate of 10 words/day—a rate that we maintain well into adolescence.1 And once we have learned how to string sentences together, we will turn into a veritable verbal torrent-producing machine. The average number of words spoken by an adult per day is 16,000 (and by the way, the average is gender neutral: women and men, it turns out, speak an equal number of words).2 Compare this 16,000 words/day to the average number of words typed by an adult per day. The average person types something around 30 words per minute.3 If you are continuously typing, the most you will type is around 1,800 words/hour. If you are constantly typing for 8 hours, you will type something close to 15,000 words. Unless you are a full-time, very focused, and highly dedicated data entry professional, the figure of the average number of words typed by an adult per day is much less than that: it’s between 3,000 to 4,000 words.
So, in essence, we produce four to five times more meaning by speaking than we do by typing or writing. (Typing and writing, by the way, includes texting and tweeting.)
How about reading text versus listening to spoken words?
No matter how strong a reader you may be, you will consume far more meaning by listening than by reading. Think about all the meetings you participate in on a daily basis: the teleconference calls; the podcasts you listen to; the social audio sessions you attend; the casual chats you have, whether face-to-face or over the phone, with your family, colleagues, and friends; the lectures you attend; the radio you listen to; the videos you watch; and the TV shows you binge on. Compare all that to how much text you read daily: SMS texts, tweets, emails, documents, articles, books. Unless you are a graduate student cooped up in the library all day, the text you read doesn’t even come close to the audio you passively and effortlessly hear.
So, in terms of pure volume, the bottom line is that we deliver and process far more meaning through audio than we do through all of the other media—probably all the other media combined.
But as the rest of this book will show, we hope, voice and audio are becoming our most compelling means of communicating with other humans not only because it is the most natural one—the one that we started using from day one—but also because it is by far the one that is most suited to a way of life that is becoming increasingly action-heavy. We are constantly doing, and we are doing it while on the move, and, crucially, we are doing it not only in collaboration with other people but also in collaboration with machines that are finally in a position to help us. These machines are helping us not just with things that are hard to do physically but also with things that are hard to do intellectually and cognitively. We need information quickly, and we have machines that can help us obtain that information. What is the fastest way of getting hold of that information? In many cases, it is by speaking. Not by typing, swiping, tapping, or pinching, but just by speaking. And so, just as we invented chainsaws and drills to get done what used to take a lot of effort, a great deal of skill, and toil, we have invented information technology that enables us to create, store, and retrieve information in a way that does not take a lot of effort, a great deal of skill, and toil to do.
We have lived long enough and witnessed firsthand the rise of the internet and all that its emergence has created, to state with little hesitation the following: just as we had no idea in 1982 what the world would look like in 2002, and just as we had no idea in 2002 what the world would look like in the year 2022, we believe that it is impossible to have any accurate ideas about what the world will look like in 2042. What we also do know is this: the best way to navigate the coming decades is to stick to some basic principles. We share five of them here.
First, we need, from the outset, to avoid the sin of establishing taboos and erecting dogmas. Yes, we need to establish rules, best practices, standards, and guidelines, and this book, for instance, is an exercise in exactly that. But whatever we propose, invent, agree on, and adopt, all of it must always be up for challenge. This ethos of constant revision is crucial given the basic fact that innovation is accelerating; hence, the need to quickly adapt to change is crucially compelling if we wish to take full advantage of the innovations we are creating.
Second, now that we can, we need to dive deep into whatever we are doing. Excellence is rare because delivering excellence is very hard. But excellence can become less rare if we adopt the ethos of diving deep into whatever we are doing. With tools, ecosystems, open source code, communities, growing, and thriving, we find ourselves increasingly in the exciting position of being able to focus on delivering on our ideas without having to waste enormous, precious resources on the means to enable those ideas. I don’t have to buy an expensive server, install expensive software, or hire expensive people to launch a solid piece of technology. Cloud services with the software I need are available and affordable. So is affordable talent: the gig economy is here to enable me to engage with software developers from around the world. The result: My team and I can focus our time, energy, and money on diving deep into use cases and focusing on delivering true value that is easy to consume by my customers.
Third, we need to realize that with voice first, we are experiencing a major technological disruption of the same magnitude as the ones we saw with the introduction of the personal computer in the 1980s, the internet in the 1990s, the rise of the smartphone in the 2000s, and the use of social media in the 2010s. The 2020s are going to be the decade of voice first (among other things) and, in general, the decade where the ability to engage our world (physical and virtual) with both our eyes and hands no longer tethered to screens, big and small, is taken for granted. How is this realization useful? Mainly, it should at the very least prime us to think in deep and broad outlines and avoid comfortable parochial stances. For instance, when we start innovating in voice first, let’s not spend our precious time and money on “voice enabling” what we can already do well with screens. Let’s dig deep, understand what makes voice so different and special from the visual/tactile interface, and then build tools—in our case, voicebots—to deliver experiences that simply cannot be delivered with the screen-based interface. This book is all about getting the reader to be ambitious in that way: what voicebot can I build, and how well can I build it, so that my voicebot enables humans to do things they couldn’t do before, or do but less clumsily, with less toil, less discomfort, and much greater ease than with anything other than my voicebot?
Fourth, within this ethos of diving deep, we need to make sure that we take seriously the one thing at the heart of what will enable us to deliver excellence, and that is: taking context seriously. And when we say context, we don’t mean only the context of the person using our voicebot, but all contexts at all levels in the process of ideating, researching, building, and pushing voicebots to the real world. Then, crucially, keeping these voicebots alive and working hard in order to remain as useful as they can be for real people confronting real problems. Yes, we need to do our homework to understand the context of the use of the voicebot, but we must also understand that a successful voicebot will not survive in the real world unless we, its creators, take seriously the context of its existence. We have built a robust voicebot—good. But have we made our staff aware of its existence? Does the customer care team know about what the voicebot does and how it can help customers? Has this team invested any resources in communicating to the customers who can benefit from it? In the crushing majority of voicebot deployments today—very expensive deployments—the astonishing answer is no. Context at every single step is either ignored, touched on faintly, or touched on in a sloppy way. We have a long way to go on this score.
Last, let us always do what we can to build the right thing. In our context of voicebot building, let us make sure that we are not embracing foundationally dubious propositions, such as building voicebots that emulate human behavior. As we have already mentioned, and will repeatedly state in this book, a human being does not interact with voicebots the way they interact with other human beings. This may seem anodyne enough—and in fact it should be—but we have seen many instances where the designer is trying earnestly to make the voicebot act “naturally”—that is, sound and behave like a human being. But that is not the task of a voicebot designer. Their task is much simpler than that, and much more likely to deliver value, and even delight. A voicebot is a tool—a mere tool—that makes use of voice and sound to help a human being do something. The voicebot designer should always approach their work with an open, creative mind rather than artificially hem themselves into the corner of human emulation. We hope this book will help the reader take one step toward embracing that professional disposition.
1 Clifford Nass, Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship (MIT Press, 2007), 1.
2 Matthias R. Mehl et al., “Are Women Really More Talkative Than Men?” Science 317, no. 5834 (July 2007): 82.
3 C. Marlin Brown, Human-Computer Interface Design Guidelines (Ablex, 1988).
Get The Elements of Voice First Style now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.