What should Alexa do?

The success of the Amazon Echo’s speech interface shows there's an opportunity for someone to build a completely new mobile operating system.

By Tim O’Reilly
September 29, 2016
Tim O'Reilly's Alexa, his Nexus 6P, and his kettle. It's his kitchen's attack on the cell phone OS. Tim O'Reilly's Alexa, his Nexus 6P, and his kettle. It's his kitchen's attack on the cell phone OS. (source: Tim O'Reilly)

Much as existing phones seemed curiously inert after your first touch of an iPhone screen in 2007, once you’ve used an Amazon Echo with Alexa, every device that isn’t always listening and ready to respond to your commands seems somehow lacking. Alexa, not Siri, not Google Now, and not Cortana, has ushered in the age of the true intelligent agent, and we are approaching a tipping point where speech user interfaces are going to change the entire balance of power in the technology industry.

Last month, I wrote a piece outlining why Apple, Google, Microsoft, automakers, home electronics manufacturers and appliance makers—virtually every consumer device developer–should be rethinking their user interfaces in light of the Echo’s success, asking themselves “What would Alexa do?” I’ve continued to think about the impact of speech user interfaces, and it’s become clear to me that Alexa challenges the very foundations of today’s mobile operating systems.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

As I illustrated in that previous piece (do read it if you haven’t already), a comparison of conversations with Alexa on the Echo and with Google’s voice recognition on my Nexus 6P Android phone, the fundamental interaction paradigm of the phone isn’t well suited to the conversational era. Like the touchscreen, the voice agent simply serves as a launcher. Control is passed to the whichever app you launch, and once that app is up and running, the voice agent is out of the picture. I’m back in the touchscreen-oriented paradigm of last generation’s apps. By contrast, with the Amazon Echo, I can “stack” multiple apps (music, weather, timers, calls out to independent apps like Uber) while Alexa remains on call, dealing with ongoing questions or commands and passing them along to whichever app seems most appropriate.

The more I thought about it, the more I realized that Alexa on the Echo seems so surprising not because its speech recognition is better (it isn’t), nor because it lets you ask for things that neither Siri nor Google can do (it doesn’t), but because its fundamental human interface is superior. The agent remains continuously, courteously present, doing its best to help.

On the phone, the easiest thing for developers to do is to simply use voice to summon the app, and then let the app’s old touchscreen metaphor take over. That’s what Apple and Google do, and that’s why the interactions seem so flawed whenever they involve a function that Siri or Google can’t complete on their own.

In short, Apple and Google will need to completely rethink iOS and Android operating systems for the age of voice. Not only that, every app will have to be refactored to accept the new interaction paradigm.

I’d already been thinking about the further implications of Alexa. In my first piece, I made the case that every device maker would need to redesign for voice control, but I hadn’t taken the thought to its logical conclusion: that there’s an opportunity for a completely new mobile OS.

The question is whether Apple, Google, Amazon, or some as-yet unknown player will seize this advantage. Given Jeff Bezos’ penchant for bold bets, I wouldn’t put it past Amazon to be the first to create a phone OS for the conversational era. I doubt that the first Alexa-enabled phone will do this, but the limitations of the handoff to Android or iOS will make clear the opportunity.

P.S. A few nights ago, when I was on stage with Mike George and Toni Reid of the Amazon Alexa team for an interview with them at the Churchill Awards, Mike said something really important when I asked him for the secret of the Echo’s success. “We didn’t have a screen,” he said. And that’s exactly it. In the age of voice, you have to design as if there is no screen, even for devices that have one. When you use an app that relies on the screen, you still have to provide affordances for the controlling voice agent to interrupt the app or modify its operation.

There’s a whole new world of user interface experimentation ahead of us, that’s for sure. But a fundamental redesign of the underlying device operating systems may also be required.

It’s easy to imagine that Amazon is already working on a successor to the ill-fated Fire phone. An Alexa phone and an Alexa tablet must surely be in the works. But I hope that Amazon doesn’t take the presence of a screen to free themselves from the constraint that has made the Echo such a work of genius. Even if you have a screen, you have to act as though you don’t, and use that constraint to think through what the most human-friendly interaction pattern would be.

I should note that the constraint alone isn’t what made the Echo so good. Mike and Toni explained eloquently how Amazon’s focus on the customer was the real secret sauce. I don’t have exact quotes, since the video isn’t available yet, and obviously, I wasn’t able to take notes, but Mike and Toni outlined how it all started with Amazon’s mission statement, “To be Earth’s most customer-centric company,” and how this manifests itself in its famous “working backwards” process. As explained by Amazon CTO Werner Vogels, this process consists of four steps:

  1. Start by writing the Press Release. Nail it. The press release describes in a simple way what the product does and why it exists—what are the features and benefits. It needs to be very clear and to the point. Writing a press release up front clarifies how the world will see the product—not just how we think about it internally.
  2. Write a Frequently Asked Questions document. Here’s where we add meat to the skeleton provided by the press release. It includes questions that came up when we wrote the press release. You would include questions that other folks asked when you shared the press release and you include questions that define what the product is good for. You put yourself in the shoes of someone using the product and consider all the questions you would have.
  3. Define the customer experience. Describe in precise detail the customer experience for the different things a customer might do with the product. For products with a user interface, we would build mock ups of each screen that the customer uses. For web services, we write use cases, including code snippets, which describe ways you can imagine people using the product. The goal here is to tell stories of how a customer is solving their problems using the product.
  4. Write the User Manual. The user manual is what a customer will use to really find out about what the product is and how they will use it. The user manual typically has three sections, concepts, how-to, and reference, which between them tell the customer everything they need to know to use the product. For products with more than one kind of user, we write more than one user manual.

Mike and Toni also talked about how the Alexa team tried to guide how Alexa should act by imagining how a human would act in the same situation. That seems like a wonderful way to think about human-computer interaction in the AI era.

Post topics: Next Economy
Post tags: Commentary

Get the O’Reilly Next Economy newsletter