Cathy Pearl on designing conversational interfaces

The O'Reilly Design Podcast: The VUI tools ecosystems, and voice gender and accent selections.

By Mary Treseler
November 10, 2016
Conversation. Conversation. (source: Valery Kenski on Flickr)

In this week’s Design Podcast, I sit down with Cathy Pearl, director of user experience at Sensely and author of Designing Voice User Interfaces. We talk about defining conversations, the growing tools ecosystems, and how voice has lessened our screen obsession.

Here are a few highlights from our conversation:

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

What constitutes a conversation?

To me, I do have a definition of ‘conversational.’ I was talking about this at O’Reilly Bot Day last week. For example: my Amazon Echo. I don’t view the Amazon Echo generally as conversational because most of the things I do are one-offs. I’ll say, ‘What time is it?’ or ‘Turn on the lights’ or ‘Set a timer,’ and she’ll give me one response, and we’re done. If I go up to you and say, ‘How are you doing today?’ and you say, ‘Fine,’ and then we turn and walk away, I don’t really see that as having a conversation. That would not be a very good conversation. One of my definitions for ‘conversational’ is that it has to have more than one turn.

A lot of times, with a lot of these voice assistants—let’s say, you can do multiple turns but they don’t remember what you said before. Each turn is like a brand-new conversation, which would be really annoying if you were talking to somebody and every time you said something, they didn’t remember anything you told them before.

In relation to that, they really need to understand pronouns. This is something that humans or toddlers can understand. I can tell a toddler to, ‘Go get the red ball out of the green box,’ and it knows it. The kid knows that I want the red ball. Computers have a really hard time with that. It’s starting to improve. Google, especially, I think, is working hard on this task. I’ve heard that with Google Home, they’re going to be better about that kind of thing, but those are some of the things I think systems need to be conversational, and that could be either through voice or through text.

Designing for how people talk not how you want them to talk

My biggest principle and advice is to design for how people actually talk and not how you want them to talk. I think as designers and developers, we get very focused on whatever we’re building, and we think it’s very obvious: ‘Yes, the user will know what they can say here.’ It’s really not true. Especially if you’re designing something like a virtual assistant, like Siri. She says, ‘How can I help you?’ That really sets up the user for failure in a lot of cases because you really can’t just say anything. There’s a limited number of things you can say. We need to spend a lot of time thinking about how will we communicate with the user, what they can actually say.

There’s different ways to do that. One thing that’s really important is when you’re first designing your system, spend a lot of time writing what we call sample dialogues, which are essentially back-and-forth conversations, like a film script between the voice user interface and the user. You write these down. Then, you read them out loud with somebody. You learn very quickly—if I wrote the system and I am reading my voice user interface prompts, and then I have someone else responding, I learn very quickly, ‘Really, someone would say that? I didn’t expect that.’ You can really build your system well from the beginning by doing some simple design exercises like that.

Another thing that’s really important to understand about voice is that speech recognition is not perfect. Yes, it’s way, way, way, way better than it used to be, but it still makes a lot of mistakes. You have to build a graceful error recovery into every voice system no matter what. I don’t think, personally, that it will ever be a 100%. Accurate human speech recognition is certainly not 100% accurate. You have to spend a lot of time thinking about your error recovery.

The tools ecosystem

I’m actually very excited right now because I think we’re starting to see a lot of tools actually come out, and I’m looking forward to learning a lot of them. For example, there’s a company called PullString. They used to be called ToyTalk. They made the Hello Barbie and some kids’ apps like the Winston Show. They just put out an authoring tool. I downloaded it. I’m really looking forward to trying that for creating new sample dialogues, new stories. Then, there are things like out of Conversant Labs, which I think will be really great for doing prototyping, which is something we’re solely lacking in the real world, the ability to do quick prototyping.

Then, you’ve got a mixture other tools from places like API.AI, which was bought by Google; Nuance’s Mix;, which is Facebook. These allow you to build models by giving a lot of sample sentences and having that learned. For example, if you’re trying to build a calendar VUI, you might put a bunch of sample sentences in about how I want to schedule an appointment. It can learn from those examples so that when somebody says something new that you didn’t already write down, it can still understand. I’m just very excited that these tools are finally coming out. It’s always been the Holy Grail of the voice user interface, where we were always trying to build tools at Nuance. It’s very difficult to do. Hopefully, we’re really getting to the point where they’re workable.

Post topics: O'Reilly Design Podcast