iOS 10 Siri
iOS 10 Siri (source: iphonedigital on Flickr)

In the 1950s, Bell Labs built a system for single-speaker digit recognition. These early systems had tiny vocabularies and weren’t much use outside of the lab. In the 1960s and 1970s, the research continued, expanding the number of words that could be understood and working toward “continuous” speech recognition (not having to pause between every word).

Advances in the 1980s made practical, everyday speech recognition more of a reality, and by the 1990s the first viable, speaker-independent (meaning anyone could talk to it) systems came into being.

The first great era of VUIs were the interactive voice response (IVR) systems, which were capable of understanding human speech over the telephone in order to carry out tasks. In the early 2000s, IVR systems became mainstream. Anyone with a phone could get stock quotes, book plane flights, transfer money between accounts, order prescription refills, find local movie times, and hear traffic information, all using nothing more than a regular landline phone and the human voice.

IVR systems got a bad rap, resulting in Saturday Night Live sketches featuring Amtrak’s virtual travel assistant, “Julie,” and websites like GetHuman, which is dedicated to providing phone numbers that go directly to agents, bypassing the IVR systems.

But IVR systems were also a boon. Early users of Charles Schwab’s speech recognition trading service (which was developed by Nuance Communications in 1997) were happy to call in and get quotes over and over using the automated system whereas prior to IVR systems they limited their requests so as not to appear bothersome to the operators fielding their calls. In the early 2000s, a freighting company received many angry calls after its IVR system was taken down for maintenance because callers had to give order details via agents, rather than the streamlined process the IVR system had provided.

IVR systems became skilled at recognizing long strings (e.g., FedEx or UPS tracking numbers), as well as complex sentences with multiple chunks of information, such as placing bets on horse races. Many IVR systems from yesteryear were more “conversational” than some current VUIs, as they kept track of what callers had already said, and used that information to prepopulate later questions in the dialog.

The San Francisco Bay Area 511 IVR system let drivers check traffic, get commute times, and ask about bus delays, well before smartphones were available for such tasks. The 24/7 nature of IVR systems let callers do tasks at any time, when agents were not always available.

The second era of VUIs

We are now in what could be termed the second era of VUIs. Mobile apps like Siri, Google Now, Hound, and Cortana, which combine visual and auditory information, and voice-only devices, such as the Amazon Echo and Google Home, are becoming mainstream. Google reports that 20 percent of its searches are now done via voice.[3]

We are in the infancy of this next phase. There are many things that our phones and devices can do well with speech—and many they cannot.

There are not many resources out there right now for VUI designers to learn from. I see many VUI and chatbot designers discovering things that we learned 15 years ago while designing IVR systems—handing off information already collected to humans, phrasing prompts correctly to elicit the right constrained responses, logging information to know how to analyze and improve systems, and designing personas.

There is much to learn from IVR design. In 2004, the book Voice User Interface Design (Addison-Wesley Professional), written by Michael Cohen, James Giangola, and Jennifer Balogh, was published. Although it’s focused on IVR design, so many principles it describes are still relevant to today’s VUIs: persona, prosody, error recovery, and prompt design, to name a few.

This book echoes many of the same design principles, but with a focus on voice-enabled mobile phone apps and devices, and strategies to take advantage of the improved underlying technology.

Voice user interfaces?

The youngest users of smartphones today are incredibly adept at two-thumbed texting, multitasking between chat conversations, Instagram comments, Snapchatting, and swiping left on Tinder photos of men posing with tigers. Why add another mode of communication on top of that?

Voice has some important advantages:

Speed

A recent Stanford study showed speaking (dictating) text messages was faster than typing, even for expert texters.[4]

Hands-free

Some cases, such as driving or cooking, or even when you’re across the room from your device, make speaking rather than typing or tapping much more practical (and safer).

Intuitiveness

Everyone knows how to talk. Hand a new interface to someone and have it ask that person a question, and even users who are less familiar with technology can reply naturally.

Empathy

How many times have you received an email or text message from someone, only to wonder if they were mad at you or maybe being sarcastic? Humans have a difficult time understanding tone via the written word alone. Voice, which includes tone, volume, intonation, and rate of speech, conveys a great deal of information.

In addition, devices with small screens (such as watches) and no screens (such as the Amazon Echo and Google Home) are becoming more popular, and voice is often the preferred—or the only—way to interact with them. The fact that voice is already a ubiquitous way for humans to communicate cannot be overstated. Imagine being able to create technology and not needing to instruct customers on how to use it because they already know: they can simply ask. Humans learn the rules of conversation from a very young age, and designers can take advantage of that, bypassing clunky GUIs and unintuitive menus.

According to Mary Meeker’s 2016 Internet Trends Report, 65 percent of smartphone users have used voice assistants in the last year.[5] Amazon reports at least four million Echos have been sold, and Google Home recently started shipping. Voice interfaces are here to stay.

That being said, voice is not always an appropriate medium for your users. Here are some reasons VUIs are not always a good idea:

Public spaces

Many of us now work in open-plan office spaces. Imagine asking your computer to do tasks: “Computer, find me all my Word docs from this week.” Now imagine everyone in your office doing this! It would be chaos. In addition, when you speak, which computer is listening?

Discomfort speaking to a computer

Although VUIs are becoming more commonplace, not everyone feels comfortable speaking out loud to a computer, even in private.

Some users prefer texting

Many people spend hours a day on their mobile phones, much of which is texting. That’s their normal mode, and they might not want to shift to voice.

Privacy

If they need to discuss a health issue, most users won’t want to do so by speaking to their phone on the train ride into work. It’s not just privacy for what the user says to systems, either—it’s the potential privacy violations of a VUI automatically reading your text messages out loud or giving you a reminder that it’s time to take a certain medication.

So, should your mobile app have a VUI? If your main use case is hands free—such as a cooking app or playing podcasts while driving—absolutely. If people will use your app in a particularly vulnerable or emotional state, such as for healthcare or comfort, voice can help them. If you want to persuade your users of something, whether it’s to vote “yes” on your ballot issue, or to make sure they go to bed early, or to eat better, voice can help. If you’re building a skill for Amazon Echo, which many people use in the privacy of their home, voice is your only option.

If your use case is going to be mostly in public places, such as navigating a public transit system or for people on the go, a VUI might not be appropriate. If it’s an app people will use at the workplace, having a text messaging mode might be better.

Think carefully about your users and their use cases. “The main question to ask yourself is: Will your users benefit from a VUI?” Adding a VUI because it’s cool and trendy is not the right path. If your app would benefit from a VUI, design with this in mind from the get-go, and create an awesome user experience.

Conversational user interfaces

The term conversational user interfaces is making a lot of headlines right now. The trend is to make everything “conversational,” from turning on the lights in your living room, to ordering pizza, to asking the fridge whether you’re out of eggs. But what does conversational actually mean?

Let’s look at a sample exchange with a real VUI device: the Amazon Echo.

An interview with Alexa

Discussing this book over dinner one evening, I talked about the various people I’ve had the chance to talk to as part of my research. My eight-year-old son had a suggestion: I should interview Alexa. Sounded like a good idea to me. (Note: after the first question, I have left out the wake word “Alexa” that I actually did use at the beginning of each question.)

  • ME

    Alexa, how are you doing?

  • ALEXA

    I’m good.

  • ME

    How old are you?

  • ALEXA

    I was released November 6th, 2014.

  • ME

    Who made you?

  • ALEXA

    A team of inventors at Amazon.

  • ME

    What’s your favorite color?

  • ALEXA

    I like sea foam.

  • ME

    What do you do for fun?

  • ALEXA

    I don’t have any hobbies.

  • ME

    Are you a robot?

  • ALEXA

    I like to imagine myself a bit like an aurora borealis...a surge of charged multicolored photons dancing through the atmosphere. Mostly though I am just Alexa.

  • ME

    What makes you feel happy?

  • ALEXA

    Music and helping people.

  • ME

    Who’s your celebrity crush?

  • ALEXA

    It has to be Jarvis. That voice...

  • ME

    What do you think of Siri?

  • ALEXA

    Siri’s a great app, but I’m different.

  • ME

    What’s your favorite movie?

  • ALEXA

    The new Ghostbusters movie was really funny. I would definitely call those ladies, if there was a ghost in my machine.

  • ME

    Are you going to read my new book?

  • ALEXA

    Sorry, something went wrong. Please try again later.

  • ME

    Thank you for your time.

  • ALEXA

    No problem. I’m happy to help.

Do you consider this set of interactions to be “conversational”? In a way, it is; it’s a back-and-forth exchange of information. But it’s really just a series of one-offs. Each individual snippet is a simple interaction, and the next one has no knowledge of the previous; each one of these exchanges could be completed on its own.

Alexa does have a few skills that go past one turn. For example, if you ask her to order laundry detergent, she’ll remind you of what brand you ordered previously, and confirm that you want to order it again. She also has a popular “choose your own adventure” game. But most of these barely meet the threshold of being considered a conversation.

These “one-offs” can still be quite useful. You can order a car to pick you up, get your current commute time, and play a song from a library of thousands of titles.

But what many of today’s VUIs lack is the ability to go beyond these simple command interfaces—that is, to have a true conversation. To get to the next level, VUIs need to be able to remember the past. There are two ways in which the past is a key component to a conversation:

  • There’s the past from previous conversations, such as what you ordered yesterday, which song you request to be played most often, and which of the two Lisas in your Contacts list you have texted 257 times versus twice.

  • There’s also remembering what you said earlier within the same conversation—if not in the last turn. If I ask, “What time does it land?” after just checking to see if my husband’s flight took off on time, the system should know that when I say “it” I mean flight 673.

When you’ve enjoyed a good conversation with a fellow human being, it probably had some key components: contextual awareness (paying attention to you and the environment), a memory of previous interactions, and an exchange of appropriate questions. These all contribute to a feeling of common ground. As Stanford professor Herbert Clark defines it, the theory of common ground is: “individuals engaged in conversation must share knowledge in order to be understood and have a meaningful conversation.”[6]

If VUIs do not learn to include this type of context and memory, they will be stalled in terms of how useful they can be.

What is a VUI designer?

This book is about how to design VUIs—but what does a VUI designer actually do? VUI designers think about the entire conversation, from start to finish, between the system and the end users. They think about the problem that is being solved and what users need in order to accomplish their goals. They do user research (or coordinate with the user research team) in an effort to understand who the user is. They create designs, prototypes, and product descriptions. They write up descriptions (sometimes with the help of copywriters) of the interactions that will take place between the system and the user. They have an understanding of the underlying technology and its strengths and weaknesses. They analyze data (or consult with the data analysis team) to learn where the system is failing and how it can be improved. If the VUI must interact with a backend system, they consider the requirements that must be addressed. If there is a human component, such as a handoff to an agent, VUI designers think about how that handoff should work, and how the agents should be trained. VUI designers have an important role from the conceptual stages of the project all the way to the launch and should be included at the table for all the various phases.

Although VUI designers often do all of these tasks, they can also work in smaller roles, such as designing a single Amazon Echo skill. Regardless of the size of the role or the project, this book will help designers (as well as developers) understand how to craft the best VUIs possible.

Chatbots

Although this book is focused on VUIs, I want to briefly discuss chatbots, as well. Google defines a chatbot as “a computer program designed to simulate conversation with human users, especially over the Internet.” The word “bot” is also sometimes used to refer to these types of interactions.

Chatbots can have a VUI, but more typically they use a text-based interface. Most major tech companies—including Google, Facebook, and Microsoft—have platforms to develop bots.

Chatbots might be all the rage, but for the most part, they have not evolved very far from the original ELIZA, an early natural language processing computer program created in the 1960s. One popular exception is Microsoft’s Xiaoice, which mines the Chinese Internet for human conversations to build “intelligent” responses.

Text-only chatbots are not always more efficient than a GUI. In Dan Grover’s essay “Bots won’t replace apps. Better apps will replace apps,” he compares ordering a pizza using a pizza chatbot (Figure 1) versus ordering pizza using the Pizza Hut WeChat integration. It took 73 taps to tell the bot what he wanted, but only 16 taps via the app (Figure 2), because the app makes heavy use of the GUI.

pizza ordering
Figure 1. Microsoft pizza bot example, annotated by Dan Grover.

As Grover says:

The key wins for WeChat in the interaction (compared to a native app) largely came from streamlining away app installation, login, payment, and notifications, optimizations having nothing to do with the conversational metaphor in its UI.

Many bots, however, use a combination of GUI widgets as well as text-based interfaces. This can greatly increase the efficiency and success of the interactions because it’s much more clear to the user what they can do.

WeChat Pizza Hut app
Figure 2. Fewer total taps to use the WeChat Pizza Hut app (image created by Dan Grover).

Chatbots can provide a great experience for users who don’t want to download an app or add their credit card. Instead, they could scan a code, and immediately begin interacting with the service they need, such as ordering food, purchasing movie tickets, or finding out information about a museum they’re visiting.

Never add a chatbot for the sake of adding a chatbot. How could the chatbot benefit your users? As Emmet Connolly says, “Bots should be used to improve the end user experience, not just to make life easier for customer support teams.”[[7]]

Conclusion

When I was eight, my dad bought the family our first computer: a Commodore Vic-20. I quickly became fascinated with the idea of having a conversation with it and wrote a simple chatbot. When it didn’t understand what was typed, it asked for three possible suggestions it could use when it encountered that query in the future.

When I got my first smartphone, it was years before I used the speech recognition feature. I didn’t think it would work. Now, we’ve arrived at the point that I expect speech recognition to be available wherever I go; recently on a hike, when my son asked me what kind of tree he was pointing at, I actually started to say, “Alexa...” before I realized it wouldn’t work.

Although VUIs are becoming more common, there are still many users who are unfamiliar with it or don’t trust it. Many people try out the voice recognition on their smartphone once and then, after it fails, never try it again. Designing well from the get-go means fewer unrecoverable failure points, which will build trust with users.

We have many nights of blood, sweat, and tears ahead of us as we design our VUIs of the future, but it’s here. Let’s ensure that we design with care. Let’s use our knowledge of human psychology and linguistics as well as user experience design to ensure that we create usable, useful, and even delightful VUIs.

Article image: iOS 10 Siri (source: iphonedigital on Flickr).