Chapter 4. The Three Core Characteristics of the VUI

Although automatic speech recognition has made significant strides in the last few years, it remains an imperfect technology that has earned the skepticism of human users. But what is truly at fault is not any technology in itself but the misuse of that technology. The voice user interface (VUI) is a powerful interface in well-defined circumstances and use cases. Deploy a VUI in the right conditions and you will have a truly delighted user. Attempt to use voice in the wrong conditions (e.g., the task is complex and requires referring to multiple pieces of information; noisy environment; the user is hard of hearing; the interface asks the user confusing, long-winded questions) and the result is at the very least an irritated user.

A common misconception the novice VUI designer often suffers from is the belief that designing a VUI consists of taking a graphical user interface (GUI) and “simplifying it” and then giving it a voice. After all, while only a very small minority of people can claim some visual talent (e.g., drawing), the vast majority of us can safely claim to be expert conversationalists—or at least competent enough to design a simple interaction between a human being and a computer. So, often, especially in situations where a successful visual/tactile interface has already been deployed, a big mistake is made: the people who designed the GUI are tasked with delivering a “voice version” of that experience.

As we will argue in later chapters, the key to delivering a successful experience is the extent to which the use case and the interface fit with each other (i.e., the interface delivers an experience that solves the use case). If you have a fit, the user will be delighted. If you have a misfit, the user will abandon the interface or will use it unhappily.

The designer of a VUI must keep the following three core aspects of conversational voice in mind.

Time Linearity

Unlike graphical interfaces, voice interfaces are linearly coupled with time. When you are reading text on a web page, for instance, you can easily skip ahead with your eyes to the section you are interested in. Not so with a voice interface, where you must patiently listen to one word before you can hear the one that follows it. Here are examples of concrete guidelines that flow directly from this basic fact:

Avoid long prompts: Obviously, unnecessarily long prompts will quickly tax the user’s patience. Long prompts explaining how the voicebot works, for instance, may be inevitable and necessary with a novice user, but they should not be forced upon an expert user. So, try to differentiate at the outset between novice and expert users, and use short, to-the-point prompts with the experts and longer ones with novice users.
Use short menus: The length of an alphabetically sorted drop-down menu on a web page is a nonissue. The length of a menu in a voice interface, on the other hand, should not exceed five or six.
Put important information first: Don’t annoy the user by having them listen to unnecessary noise for the information they need. Give them what they want up front.
Allow interruptions: The ability to interrupt is usually a must-have when dealing with nonnovice users. People who know what they want to do, what to say, and how to say it, don’t want to wait for the voicebot to finish talking before they can give their response.
Offer shortcuts for the user who knows what to do: Another must for nonnovice users are shortcuts that cut through menus and get the user to what they want to do or where they want to be in a dialog.
Allow pauses: An enormous advantage that a graphical interface has over a voice interface is the ability of the user to easily pause and pick up where they left off. We do this without even thinking when we read a piece of text. During voice interactions where the user may need to pause and do something, make sure you offer that option. For instance, if the user needs to take down a long series of numbers (say a confirmation code), ask them to go ahead and get paper and pencil and to say, “Continue,” when they are ready.

Unidirectionality

Compounding the linearity of speech is its unidirectional character. Just as time is a one-way street, speech is a one-way medium. When you hear something, you can’t easily go back and listen to it again. Contrast this to reading a piece of text where you can readily scan a couple of paragraphs, or even pages, then go back and reread the text. There are ways to handle this issue:

Offer to repeat: One obvious way to alleviate this limitation is to offer the user the ability to have information repeated to them. Of course, make sure the user is aware that they can have information repeated to them by informing them of this ability at the beginning of the interaction and any time where important information is given out to them.
Offer help: Crucial information such as instructions given at the start of the interaction should be available for the user to tap into at any point in the exchange. Offer instructions on how to access help at the beginning of the interaction and at moments where the user is at a loss over what to do (e.g., at no-input or a no-match).
Offer summaries: In interactions where information is being gathered from the user or given out to them in a stepwise fashion, a powerful technique to overcome the unidirectionality of voice interfaces is to offer users the ability to ask for a summary of information collected so far.

Invisibility

Perhaps the most frustrating thing about using a voice interface is the feeling of not knowing precisely where you are in the interaction and what exactly the assistant expects you to do next. A well-designed web site will show navigators where in the menu tree they are, but even without a menu path indicator, a web page usually has enough visual clues to tip the user on where they are in the site—a URL being one simple indicator. Not so with a voice interface, where the user can quickly feel lost for a lack of mental markers positioning them where they precisely are in the exchange with the assistant. There are ways to correct this:

Mark the exchange: Just like a well-designed web page will indicate where in the web site a user is, a good voice interface will tell the user where in the conversation they are positioned. Usually, a few will suffice: “Looking up transactions” before engaging in an exchange where the user wishes to find out their latest bank transactions, or “Quizzing” before beginning or resuming a quiz sequence.
Trace the path: In interactions where the conversation structure is deep and wide, users can very easily become confused about where they are in the interaction, even when you mark the individual levels. In such situations, you can associate with each dialogue state that handles an interaction a “position marker” that traces, starting from the main menu, the position of the user within the menu tree. The voicebot saying something as succinct as “Restaurants, Chinese, Zip code,” for instance, could help the user understand that they chose “Restaurants,” then “Chinese,” and are now being asked, or were asked, for a zip code to locate Chinese restaurants within that zip code. The designer can be less succinct with alternative phrasing such as, “We are looking for Chinese restaurants and the next thing I need from you is your zip code.” The main idea is to give a way for the user to situate themselves in a complex exchange with the voicebot.
Use earcons: An earcon, or auditory icon, is the voice equivalent of a graphical interface’s icon. An icon is a small graphic that means something specific in the context of an interaction—for instance, an arrow pointing to the right may mean go to the next page, and one to the left may mean go back to the previous page. Earcons can be very useful in positioning the user within a conversation or in announcing the type of action that is about to be undertaken. For example, the sound of a keyboard clicking could be used to indicate to the user that the voicebot is busy doing something (while dead silence may be taken by the user that the assistant crashed or the call had ended).

Perhaps the one fundamental advantage that GUIs have over VUIs is the feeling that a graphical user has control over both the medium and the interaction. A bad GUI can certainly frustrate the user, but it does take a very bad GUI to throw the user into a state of utter confusion. A VUI, on the other hand, because it is time-linear, unidirectional, and invisible, has to stumble only a couple of times in the interaction for the user to be thrown into a state of confusion. Keeping in mind that there are key differences between designing a GUI and a VUI should help the alert VUI designer avoid making the costly mistake of smuggling GUI assumptions when engaged in VUI design.

Get The Elements of Voice First Style now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

The Elements of Voice First Style by Ahmed Bouzid, Weiye Ma

Chapter 4. The Three Core Characteristics of the VUI

Time Linearity

Unidirectionality

Invisibility

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly