Designing for voice and audio technology

A look at the underlying technology and considerations for VUI design decisions.

By Laura Klein
October 21, 2015
A "radiophone dance" held by an Atlanta social club in May 1920. A "radiophone dance" held by an Atlanta social club in May 1920. (source: Wikimedia Commons)

Download our new free report “Design for Voice Interfaces,” by Laura Klein. Editor’s note: this is an excerpt from the report.

Before we can understand how to design for voice, it’s useful to learn a little bit about the underlying technology and how it has evolved. Design is constrained by the limits of the technology, and the technology here has a few fairly significant limits.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

First, when we design for voice, we’re often designing for two very different things: voice inputs and audio outputs. It’s helpful to think of voice interfaces as a conversation, and, as the designer, you’re responsible for ensuring that both sides of that conversation work well.

Voice input technology is also divided into two separate technical challenges: recognition and understanding. It’s not surprising that some of the very earliest voice technology was used only for taking dictation, given that it’s far easier to recognize words than it is to understand the meaning.

All of these things — recognition, understanding, and audio output — have progressed significantly over the past 20 years, and they’re still improving. In the 90s, engineers and speech scientists spent thou‐ sands of hours training systems to recognize a few specific words.

These are known as “finite state grammars” because the system is only capable of recognizing a finite set of words or phrases. You can still see a lot of these in IVRs, which are sometimes known as “those annoying computers you have to talk to when you call to change your flight or check your bank balance.”

As the technology improves, we’re building more products with “statistical language models.” Instead of a finite set of specific words or phrases, the system must make decisions about how likely it is that a particular set of phonemes resolves to a particular text string. In other words, nobody has to teach Siri the exact phrase “What’s the weather going to be like in San Diego tomorrow?” Siri can probabilistically determine how likely it is that the sounds coming out of your mouth translate into this particular set of words and then map those words to meanings.

This sort of recognition, along with a host of other machine-learning advances, has made Natural-Language Processing (NLP) possible, although not yet perfect. As NLP improves, we get machines that not only understand the sounds we’re making but also “understand” the meaning of the words and respond appropriately. It’s the kind of thing that humans do naturally, but that seems borderline magical when you get a computer to do it.

VUI versus GUI: What’s new and what’s not

These recent technological advances are incredibly important for voice user interface (VUI) designers simply because they are making it possible for us to interact with devices in ways that 10 or 20 years ago would have been the stuff of science fiction. However, to take full advantage of this amazing new technology, we’re going to have to learn the best way to design for it. Luckily, a lot of the things that are core to user experience (UX) design are also necessary for VUI design. We don’t need to start from scratch, but we do need to learn a few new patterns.

The most important part of UX design is the user — you know, that human being who should be at the center of all of our processes — and luckily that’s no different when designing for voice and audio. Thomas Hebner, senior director of UX design practice and professional services product management at Nuance Communications, has been designing for voice interfaces for 16 years. He thinks that the worst mistakes in voice design happen when user goals and business goals don’t line up.

Great products, regardless of the interaction model, are built to solve real user needs quickly, and they always fit well into the con‐ text in which they’re being used. Hebner says, “We need to practice contextually aware design. If I say, ‘make it warmer’ in my house, something should know if I mean the toast or the temperature. That has nothing to do with speech recognition or voice design. It’s just good design where the input is voice.”

This is important. Many things about designing for voice — understanding the user, knowing the context of use, and ensuring that products are both useful and usable — are all exactly the same as designing for screens, or services, or anything else. That’s good news for designers who are used to building things for Graphical User Interfaces (GUIs) or for systems because it means that all of the normal research and logic skills transfer very nicely when incorporating speech into designs. If you understand the basic user-centered design process and have applied it to apps, websites, systems, or physical products, many of your skills are completely transferrable.

Yet, there are several VUI-specific things that you won’t have run into when designing for other sorts of interactions, and they’re important to take into consideration.

Conversational skills

Content and tone are important in all design, but when designing for speech output, it takes on an entirely new meaning. The best voice interface designs make the user feel like she’s having a perfectly normal dialog, but doing that can be harder than it sounds. Products that talk don’t just need to have good copy; they must have good conversations. And it’s harder for a computer to have a good conversation than a human.

Tony Sheeder, senior manager of user experience design at Nuance Communications, has been with the company for more than 14 years and has been working in voice design for longer than that. As he explains it:

Each voice interaction is a little narrative experience, with a beginning, middle and an end. Humans just get this and understand the rules naturally — some more than others. When you go to a party, you can tell within a very short time whether another person is easy to talk to. Until recently, speech systems were that guy at the party doing everything wrong, and nobody wanted to talk to them.

While many early voice designers have a background in linguistics, Sheeder’s background was originally writing scripts for interactive games, and it helped him write more natural conversations. But, designing for voice communication wasn’t always successful. Early voice interfaces often made people uncomfortable because the designers felt as if people would need explicit instructions. They’d say things like, “Do you want to hear your bank balance? Please, say yes or no.” This violates basic rules of conversation. Sheeder felt that these interfaces made people feel strange because “the IVR would talk to you like it was human, but would instruct you to talk to it like a dog. It was like talking to a really smart dog.”

For more on voice interfaces, check out the sessions on designing for voice at the O’Reilly Design Conference, which will be held January 19-22, 2016, in San Francisco.

Post topics: Designing for IoT