Basic principles for designing voice user interfaces
Challenges and opportunities, how VUIs differ from IVRs, and tools for creating great conversational designs.
In the early 2000s, interactive voice response (IVR) systems were becoming more common. Initially a touch-tone/voice hybrid (“Please press or say one”) and very primitive, they became an expected way to communicate with many companies. IVRs could help callers get stock quotes, book flights, transfer money, and provide traffic information. Many of them were designed poorly, and websites popped up with back doors on how to get transferred immediately to an operator (something many companies actively tried to hide). IVRs got a bad reputation, ending up the subject of satire on Saturday Night Live.
IVRs were created to automate tasks so customers would not always have to speak to a live person to get things done. They were created before the internet became commonly used and before smart phones were invented.
Nowadays, many IVRs are used as the “first response” part of a phone call, so that even if the caller ends up speaking with an agent, basic information has been collected (such as a credit card number). For many tasks, even complex ones such as booking a flight, an IVR can do the job. In addition, IVRs are great at routing customers to a variety of different agent pools, so that one phone number can serve many needs. Finally, some users actually prefer using an IVR over speaking with an agent, because they can take their time and ask for information over and over (such as the early Charles Schwab stock quote IVR) without feeling they’re “bothering” a human agent.
Although some of the design strategies from the IVR world also apply to mobile voice user interface (VUI) design (as well as VUI for devices), mobile VUIs also present a unique set of challenges (and opportunities). This chapter outlines design principles for the more varied and complex world of designing VUI systems.
One of the challenges of mobile VUIs is whether or not it will have a visual representation, such as an avatar. In addition, when will your VUI allow the user to speak? Will users be able to interrupt? Will it use push-to-talk? These challenges are discussed later in the book.
One of the opportunities that mobile devices have that IVRs do not is that mobile devices can have a visual component. This can be a big advantage in many ways, from communicating information to the user, to confirming it, even to help the user know when it’s their turn to speak. Allowing users to interact both via voice and using a screen is an example of a “multimodal” interface. Many of the examples in this book are for multimodal designs. In some cases, the modes live together in one place, such as a virtual assistant on a mobile phone. In others, the main interaction is voice-only, but there is also a companion app available on the user’s smartphone.
For example, let’s say you ask Google, “Who are the 10 richest people in the world?” Google could certainly read off a list of people (and their current worth), but that is a heavy cognitive load. It’s much better to display them, as shown below.
Taking advantage of the visual capabilities of mobile is essential to creating a rich VUI experience. In addition, this visual component can allow the user to continue at a more leisurely pace. In an IVR, it is rare to be able to pause the system—instead, the user must continually interact.
If your VUI will have a visual component, such as a mobile app, video game, or smart watch, it’s important to design the visual and the voice in tandem. If the visual designer and the VUI designer don’t work together until the end, the joining of the two mediums can be awkward and haphazard. VUI and visual are two components of the same conversation the user is having with the system. It’s essential to design together from the beginning.
Another current common difference between IVRs and VUIs on mobile apps or devices is that they are often used for one-turn tasks. For example, I’ll ask Cortana to set an alarm (Figure 2), or Google what the fastest land animal is, or Amazon Echo’s Alexa to start playing my favorite radio station. These types of interactions are quite contained and do not require the system to maintain a lot of information.
Although this is quite common now, do not confine your VUI experience to this model. To start thinking more specifically about how to best design VUI for mobile, let’s dive first into the topic of conversational design.
Imagine you’re having a conversation with a friend. You’re sitting in a coffee shop, catching up after a long absence. Your friend says, “Did you see the new Star Wars movie?” “Yes,” you reply. “Did you like it?” she asks next. You say “I’m sorry, I don’t understand.” No matter how many times she repeats herself, you never answer her question.
That level of frustration is about where we’re at with many VUI systems today. Despite the many recent advancements of speech recognition technology, we’re still a long way from simulating human conversation. Here’s a real world example from OK Google, illustrating two conversational “turns” (a turn is one interaction between the user and the system):
Ok Google. When’s my next appointment?
You have a calendar entry tomorrow. The title is ‘Chinatown field trip’.
Ok Google. Can you please repeat that?
Google has let down its end of the conversation. It’s like the first part never happened. “Conversational design” is becoming a common term, but it is often misused. Many people use it to mean any time you have an interaction with a system in which you speak, or text. But many of these “conversations” have only one turn; for example, asking Hound where the nearest coffee shop is located.
In this book, I define conversational design to mean thinking about an interaction with a VUI system beyond one interaction. Humans rarely have conversations that only last one turn. Design beyond that one turn; imagine what users might want to do next. Don’t force them to take another turn, but anticipate and allow it. In addition, it is vital to keep a recent history of what the user has just told you. Having a conversation with a system that can’t remember anything beyond the last interaction makes for a dumb and not very useful experience.
When designing a VUI, many people only consider one-off tasks, such as answering a search query, setting up a calendar appointment, placing a phone call, playing a song, etc. Sometimes these tasks can be accomplished in one fell swoop. But the best VUI designs also consider what happens next.
Here’s an example in which Google does a good job of remembering what occurred in previous conversational turns:
Ok Google. Who was the 16th President of the United States?
Abraham Lincoln was the 16th President of the United States.
How old was he when he died?
Abraham Lincoln died at the age of 56.
Where was he born?
What is the best restaurant there?
Here is Paula’s Hot Biscuit:
It’s not quite the same as talking to a human, but Google successfully carried on the conversation for four turns, knowing the references for “he” and “there.” In addition, Google switched to a more visual mode at the appropriate time: to show the map and reviews for the restaurant.
A good rule of thumb is to let the user decide how long the conversation will be.
Setting user expectations
Good conversational design is not just about crafting nice prompts. As Margaret Urban, interaction designer at Google, suggests: don’t ask a question if you won’t be able to understand the answer. She gives the example of a prompt that occurs after the user has finished writing an email: “Do you want to send it or change it?” One response, which you may not initially have planned for, will be “yes”—so build a response in your system to handle it. Although good prosody (where you place the emphasis) can help with this issue, it is often not enough. In a case where you’re seeing a lot of “yes” responses, you may want to reconsider rewording the prompt to something more clear, such as “What would you like to do—send it, or change it?” Urban emphasizes it’s important to set user expectations early on. How does your app introduce voice? You can offer a “tour” to first-time users, and offer educational points along the way. As Urban says:
When someone has successfully completed a VUI interaction, it’s a bit of an endorphin boost—the user has a glow of completion and satisfaction. It’s a nice time to educate people—“Since you were great at that, how about trying this?”
Be careful about telling users tasks were successful. Urban says “Setting the alarm,” for example, implies to the user the alarm has been set, whereas the engineer may argue that the task hasn’t necessarily been completed yet and should have an additional prompt that says ‘Alarm set successfully.’”
The Amazon Echo has the following dialog when setting a timer:
Alexa, set a timer for 10 minutes.
Setting a timer for 10 minutes.
Imagine the conversation with an additional confirmation:
Alexa, set a timer for 10 minutes.
Setting a timer for 10 minutes.
Okay, timer successfully set.
It’s unnecessary verbiage. If in fact the time did at some point fail to be set, then it would be good to alert the user—but that’s the exception.
Urban offers a good analogy about designing with breadth. Perhaps you’ve designed a system that allows people to set an alarm—but not given them a way to cancel it. She likens this to giving someone a towel for a shower, but no soap. If you set an expectation you can accomplish a task, think about the corresponding (symmetrical) task that goes with it.
Discoverability is another important element of design. How does your user know what and when they can speak? I discovered my Android camera app was voice-enabled purely by accident—while taking a picture one day, I naturally said “smile!” and the photo clicked. I quickly discovered I could also say “1.. 2.. 3!” and “say cheese!” and it would also take a photo. This is a great example of piggybacking off of user’s natural speech.
Another example of a command I discovered by accident occurred after I had to reboot my Amazon Echo. When it came back to life, without thinking I said “Alexa, are you working?” and she replied that everything was in working order. I never stopped and thought “What can I ask Alexa to see if everything’s working again?” but my spontaneous request was handled. What a better way to check internet connectivity than going to Network Settings on a computer!
When asking the user for information, it’s often better to give examples than instructions. If you’re asking for date of birth, for example, rather than say “Please tell me your date of birth, with the month, day, and year,” use an example: “Please tell me your date of birth, such as July 22, 1972.” It’s much easier for users to copy an example with their own information than translate the more generic instruction.
To assist you in creating great conversational designs, let’s talk about tools.
One of the best (and cheapest!) ways to start your design process is something called a sample dialog. A sample dialog is a snapshot of a possible interaction between your VUI and your user. It looks like a movie script: dialog back and forth between the two main characters. (The Google examples above are in the form of sample dialogs.)
Sample dialogs are not just a way to design what the system will say to the user (or display to the user). They are a key way to design an entire conversation. Designing prompts one at a time often leads to stilted, repetitive, and unnatural-sounding conversations.
Pick five of the most common use cases for your VUI, and write out some “blue sky” (best path) sample dialogs for each of the cases. In addition, write a few sample dialogs for when things go wrong: the system did not hear the user, or misunderstood them. When you’ve written a few, or even as you write, read them out loud: often, something that looks great written down sounds awkward or overly formal when you say it.
Sample dialogs are very low tech, but they are a surprisingly powerful way to determine what the user experience will be like, whether it’s for an IVR, a mobile app, or inside the car. In addition, it’s a great way to get buy-in and understanding from various stakeholders. Sample dialogs are something anyone can grasp, and quickly.
A great tool for this is the screenwriting software Celtx, but any place you can write text will do.
Once you’ve written some sample dialogs, a very useful design exercise is to do a “table read”: read it out loud with another person. Another great use of sample dialogs is to record them, either using voice talents, or text-to-speech (whichever will be used by your system). It is slightly higher cost than simply writing them, but an even more powerful way to know if the design sounds good before investing in more expensive design and development time.
When designing a mobile app, wireframes / mocks are of course also an important piece of your early design process for a VUI app—they’ll go hand-in-hand with sample dialogs to help visualize the user experience. Your sample dialogs plus wireframes/mocks are your storyboard: it’s crucial to put them together. If the VUI team is separated from the visual team, make sure you come together for this piece. To the user, it’s one experience—VUI designers and visual designers must work together closely, even in early phases.
As this book is VUI-focused, we do not go into detail about best practices for visual design tools.
Once a variety of sample dialogs have been written and reviewed, the next step is to sketch the VUI’s flow. “Flows” (referred to as “callflows” in the IVR world) are diagrams that illustrate all the paths that can be taken through your VUI system. The level of detail for this flow depends on the type of system you are designing. For an IVR, or a closed conversation, the flow should include all possible branches the user can go down. (Figure 5) This means that for each turn in the conversation, the flow will list out all the different ways the user can branch to the next state. This could be for simple states that allow only “yes” and “no” type responses as well as more complex ones that might have 1,000 possible song titles. The diagram needn’t list every phrase someone can say, but should group them appropriately.
In the case of something more open ended, such as a virtual assistant, the flow can be grouped into types of interactions. For example: calendar functions, search, calling/texting, etc. In these cases, not all possible interactions can be spelled out, but it helps to group the various intents. (Figure 6):
As of the writing of this book, some VUI and NLU (natural language understanding) specific tools were just starting to emerge. These include Tincan.AI from Conversant Labs, Pullstring’s authoring tool, Wit.ai, Api.ai (now owned by Google), Nuance Mix, and others.