O'Reilly logo

Designing Voice User Interfaces by Cathy Pearl

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required


WE LIVE IN A MAGICAL TIME. While lounging on my living room sofa, using only my voice I can order a pound of gummy bears to be delivered to my door within two hours. (Whether or not it’s a good thing that I can do this is a discussion for another book.)

The technology of speech recognition—having a computer understand what you say to it—has grown in leaps and bounds in the past few years. In 1999, when I began my career in voice user interface (VUI) design at Nuance Communications, I was amazed that a computer could understand the difference between me saying “checking” versus “savings.” Today, you can pick up your mobile phone—another magical device—and say, “Show me coffee shops within two miles that have WiFi and are open on Sundays,” and get directions to all of them.

In the 1950s, when computers were beginning to spark people’s imaginations, the spoken word was considered to be a relatively easy problem. “After all,” it was thought, “even a two-year-old can understand language!”

As it turns out, comprehending language is quite complex. It’s filled with subtleties and idiosyncrasies that take humans takes years to master. Decades were spent trying to program computers to understand the simplest of commands. It was believed by some that only an entity that lived in the physical world could ever truly understand language, because without context it is impossible to understand the meaning behind the words.

Speech recognition was around in science fiction long before it came to exist in real life. In the 1968 film 2001: A Space Odyssey, the HAL 9000 unit is an intelligent computer that responds to voice commands (although it didn’t always do what was asked). The movie, and HAL 9000, made a strong impression on moviegoers. Even now, people like to test VUIs and chatbots with the famous line, “Open the pod bay doors, HAL.”

In the movie Star Trek IV: The Voyage Home (1986), the crew of the Enterprise travels back in time to 1986, and when Chief Engineer Scotty is given a computer to work with, he addresses it by voice, saying “Computer!” When the computer doesn’t respond, Doctor McCoy hands him the mouse, which Scotty attempts to use as a microphone. Finally, when told to use the keyboard, he comments, “How quaint.” No doubt someday keyboards really will seem quaint, but we’re not there yet. However, we’re as close to the science fiction of voice recognition as we’ve ever been. In 2017, online retailer ThinkGeek will release a Star Trek “ComBadge”: just like in the TV series from the 1980s, it allows users to tap the badge and speak voice commands, which are sent via Bluetooth to your smartphone.

I find the existence of this product quite significant. Although telephone-based speech systems have been around for 20 years and mobile phone VUIs for almost 10, this badge signifies coming full circle to the original vision of what voice technology could truly offer. It’s life imitating imagination.

Why Write This Book?

So, if we’re already there—if we’re already at Star Trek levels of human–computer voice interactions—why do we need this book?

If you have ever had difficulty with a poorly designed thermostat, or turned on the wrong burner on a stove (I personally still do this with my own stove after 13 years of use), or tried to pull on a door when it should have been pushed,[1] you know that without good design, technology is difficult or even impossible to use.

Having speech recognition with high accuracy only solves part of the problem. What do you do with this information? How do you go from recognizing the words to doing what someone actually wants?

The ability of today’s smartphones to understand what you say and then act on it is a combination of two important technologies: automated speech recognition (ASR) and natural-language understanding (NLU). If someone spoke to you in a language you didn’t understand, you could probably write down, phonetically, what they said. That’s the ASR piece. But you would have no idea what it meant.

One of the most important aspects of good VUI design is to take advantage of known conversational principles. Your users have been speaking out loud and engaging in conversations with others since they were toddlers. You can ask a young child, “Please get the green ball out of the red box and bring it to me,” and she knows you mean the ball, not the box (this is called coreference and is something that’s difficult for computers).

The cooperative principle refers to the fact that listeners and speakers, in order to have a successful conversation, must act cooperatively. Paul Grice introduced this idea and divided it into four maxims:[2]


Say what you believe to be true.


Say as much information as is needed, but not more.


Talk about what is relevant to the conversation at hand.


Try to be clear and explain in a way that makes sense to others.

Many of us have had conversations with others in which these maxims are not followed, and we ended up experiencing confusion or frustration. VUIs that don’t follow these maxims will cause similar issues. Here are some examples of ways that VUIs break these maxims that can negatively affect the user’s experience:


Advertising things you can’t live up to, such as saying, “How can I help you?” when really all the VUI can do is take hotel reservations.


Extra verbiage, such as “Please listen carefully, as our options may have changed.” (Who ever thought, “Oh, good! Thanks for letting me know”?)


Giving instructions for things that are not currently useful, such as explaining a return policy before someone has even placed an order.


Using technical jargon that confuses the user.

People are accustomed to a variety of conversational and social practices, such as greeting people with “Hello, how are you?” even when engaging in a business transaction, and making sure to end the conversation before hanging up or walking away. VUIs are not humans, but they still benefit from following basic social conventions.

Even if your VUI follows these principles, will it truly understand your user? And does it matter?

The Chinese Room and the Turing Test

In 1980, philosopher John Searle proposed “the Chinese room argument,” in which a person sits in a room and is handed pages of Chinese symbols. The person, who does not read or understand Chinese, looks up the symbols in a rule book (which provides appropriate characters in response), copies the responses, and then hands them back.

To someone outside the room, it appears as if the person responding understands Chinese perfectly. Searle argued that if a computer did the same thing, we might consider it intelligent—when in fact, no thinking is involved at all. After all, the person in the room does not understand Chinese.

In 1950, Alan Turing introduced a test to answer the question “Can machines think?” Every year since 1991, the Loebner Prize is awarded to the creator of the computer that is best at fooling human judges into thinking it is human. People chat (type) with the computer program as well as humans, and try to discern which is human and which is computer. Over the years, the programs have continued to become more sophisticated, but no contender has yet to claim the gold medal--fooling all judges into thinking the computer is human. Amazon recently created its own competition—the “Alexa Prize.” The grand challenge for the 2017 Alexa Prize is to create a socialbot that converses coherently and engagingly with humans on popular topics for 20 minutes.

This book is not a philosophical one. Whether a computer “thinks” is not a question for these pages. Instead, this book takes a more practical approach. Fooling people into thinking a VUI or bot is human is not necessary for success. Although replicating many of the aspects of human conversation is crucial for a good VUI, in many ways, it’s better to be up front that the user is speaking to a computer. People are more forgiving if they know they’re speaking to a bot. The goal of your VUI shouldn’t be to fool people into thinking it’s a human: it should be to solve the user’s problem in an efficient, easy-to-use way.

Who Should Read This Book

The main audience for this book comprises people who are designing VUIs, whether for a mobile phone VUI, a toy, or a device such as a home assistant. Although many general user interface design principles still apply to VUIs, there are still important differences between designing for VUIs and designing for websites or GUI-only mobile apps. With GUIs, the number of things your users can do is constrained, and it’s clear when someone has pressed a button or chosen a menu item. When someone speaks, we have a good theory about what that person said, but there are many additional design pieces necessary to ensure a good user experience.

Developers who are creating their own VUIs (or other types of conversational user interfaces such as chatbots) will also benefit from understanding the basic design principles, so that even prototypes are more likely to be successful.

Managers and business developers can learn about the challenges of designing VUIs and whether VUIs are right for the problem they are trying to solve. In some cases, a GUI app will do the job just fine, and a VUI is not needed.

How This Book Is Organized

Chapter 1: Introduction

This introductory chapter covers a brief history of VUIs and whether a VUI is right for you and your app. It also outlines what “conversational” means, and provides an overview of chatbots.

Chapter 2: Basic Voice User Interface Design Principles

This chapter lays the groundwork for what you need to know to create a VUI. This covers essential design principles on topics such as design tools, confirmations, error behavior, and novice versus expert users.

Chapter 3: Personas, Avatars, Actors, and Video Games

Chapter 3 is useful for designers who would like to add an avatar or character to their VUI. It’s also useful if you’re not sure if your VUI should have an avatar. In addition, it discusses persona design, which is essential for all VUIs.

Chapter 4: Speech Recognition Technology

This chapter is essential for VUI designers. It’s a primer on understanding pieces of the technology itself which will have a big impact on design.

Chapter 5: Advanced Voice User Interface Design

Chapter 5 goes beyond what’s covered in Chapter 2. It includes more complex strategies for natural-language understanding, sentiment analysis, data collection, and text-to-speech.

Chapter 6: User Testing for Voice User Interfaces

This chapter details how user testing for VUIs differs from user testing for mobile apps and websites. It covers low-fidelity testing methods as well as testing remotely and in the lab. There is also a section on testing VUIs in cars and other types of devices.

Chapter 7: Your Voice User Interface Is Finished! Now What?

This chapter outlines the methodologies needed for when your VUI is “in the wild.” It covers how and what information you can analyze to understand and improve performance. Don’t wait until you launch to read this chapter, however, because it’s essential to know what to log while the system is still being developed.

Chapter 8: Voice-Enabled Devices and Cars

The final chapter focuses on VUIs that are not covered in earlier chapters. The “Devices” section covers home assistant devices and wearables. The section “Cars and Autonomous Vehicles” reviews the challenges and best practices of designing for automobiles. Much of this chapter relies on contributions from other experts in the field.

Some designers will be creating a VUI from end to end, as a standalone systems, while others will use an existing platform, such as a single skill for the Amazon Echo. For those readers focused on building on top of an existing platform, Chapter 2, Chapter 4, and Chapter 5 will be especially relevant.

O’Reilly Safari

Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.

Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.

For more information, please visit http://oreilly.com/safari.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

  • O’Reilly Media, Inc.

  • 1005 Gravenstein Highway North

  • Sebastopol, CA 95472

  • 800-998-9938 (in the United States or Canada)

  • 707-829-0515 (international or local)

  • 707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/designing-voice-user-interfaces.

To comment or ask technical questions about this book, send email to .

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Follow us on Twitter: http://twitter.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia


This book could not have been written without the help of so many others.

I must begin by recognizing Karen Kaushansky, who originally got me in touch with O’Reilly Media when they had the foresight to commission a book on the topic of VUIs. Next, to Nick Lombardi at O’Reilly, who talked me through the process and made me believe it was doable, even if I did have a fulltime job! Angela Rufino, my editor at O’Reilly, was instrumental in shaping the book by providing encouragement and useful editing suggestions.

My thanks to my technical reviewers for their time, opinions, and insightful suggestions on the whole kit and caboodle: Rebecca Nowlin Green, Abi Jones, Tanya Kraljic, and Chris Maury.

Thanks to Ann Thyme-Gobbel, who generously offered to review many chapters, and who I can always count on to share the good and the bad of VUIs.

Thanks to my other reviewers, Vitaly Yurchenko and Jennifer Balogh, for being so generous with your time and providing thoughtful editing suggestions.

To my contributors, I extend my deepest appreciation: Margaret Urban, Lisa Falkson, Karen Kaushansky, Jennifer Balogh, Ann Thyme-Gobbel, Shamitha Somashekar, Ian Menzies, Jared Strawderman, Mark Stephen Meadows, Chris Maury, Sara Basson, Nandini Stocker, Ellen Francik, and Deborah Harrison.

I also would like to recognize my coworkers at Nuance Communications, where I spent eight years learning what the heck this speech recognition stuff was all about, and the day-to-day practicalities of creating interactive voice response systems: it was a wonderful time in my life.

To Ron Croen and the rest of my team at Volio, thank you for convincing me to give VUIs another try, after I’d sworn them off forever.

To my team at Sensely and our virtual nurse Molly, for pushing the envelope with VUIs in order to help people lead healthier lives. Thank you so much.

And finally, my greatest appreciation goes to my family. To my son, Jack, who has helped me see what VUIs mean for the next generation. He immediately welcomed Amazon Echo’s Alexa as a new member of our household with requests for jokes, homework help, and to play “The Final Countdown.” Just one. More. Time.

And to my husband, Chris Leggetter, thank you so much for your infinite support during this entire book-writing roller coaster, from the highs (“I think this book thing is going to happen!”) to the lows (“Oh no, what have I done!”). Thank you for your patience. Now we can finally watch Season 4 of House of Cards.

[1] For more on this, see “Norman Doors: Don’t Know Whether to Push or Pull? Blame Design” (http://99percentinvisible.org/article/norman-doors-dont-know-whether-push-pull-blame-design/).

[2] Grice 1975.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required