People are really good at talking to each other. That shouldn’t be too surprising. Conversation among human beings has evolved over a very long period of time — and now we’re starting to talk to our stuff, and in some cases, it’s talking back.
Asking Siri (or Cortana or Google Now) some simple questions is just the beginning of what’s coming. In fact, we’re in the midst of a significant shift in voice and conversation technology. Companies like Amazon, Facebook, and Google are falling over each other to hire researchers and acquire related companies, and they are starting to use this talent in new and interesting ways.
This is the first post in a series of articles I’ll use to explore speech and conversational interfaces. The subject will be dialog systems in general, with a focus on the intelligent interfaces we can expect to see more of in the future. Other topics could include:
- Design considerations for spoken language systems
- Emerging research in the area
- Changes to how we interact with technology plus the social impact they might have
If you’re someone with a finely tuned hype radar, some skepticism about just how good these technologies might be is understandable. Most of the speech-to-text and automated telephone interactions available up to this point have been frustrating to use. People regularly share tips for short-circuiting interactive voice response (IVR) trees (I hear swearing helps!). And even Siri can seem clueless a lot of the time.
But there are also indications that things are improving. Bigger data sets for training, faster processors, and innovations in algorithms are advancing deep learning so it can now be applied to some of the technical problems that still plague voice recognition, like operating in noisy environments or otherwise dealing with imperfect inputs. We’re still some ways off from automated agents with complete human-like understanding of the world, but by keeping the focus on narrower domains, researchers have been able to create effective experiences for people when talking to computers.
You can talk to your house?
If you doubt the trajectory of rapid improvement in conversational interactions, get ready to be caught flat-footed. And despite the short-term frustrations, it could be a really good thing. Being able to talk to systems like our cars or our houses opens up all kinds of possibilities. Notice that I’m not talking about speech recognition itself (although that’s an important enabler), but rather I’m referring to conversation — rich interactions between you and your technology. Some researchers and designers are now imagining the Conversational User Interface (CUI) as a new paradigm in technology that could be as transformative as Graphical User Interfaces were at one time.
What does it mean to have a conversational interaction? It’s not the same problem as understanding natural language queries for simple question answering. Several search interfaces are getting quite good at that, but that interaction is generally a one-shot deal. You ask a question and get an answer, or in the too frequent fallback case, you get a list of search results.
So what counts as a conversation?
A conversation is a sequence of turns where each utterance follows from what’s already been said and is relevant to the overall interaction. Dialog systems must maintain a context over several turns. Current personal assistants can maintain limited context for a few turns, but you still wouldn’t consider the interaction a rich experience. But those experiences are coming. I’m not suggesting the kind of Hollywood-invented, hyper intelligence that far exceeds human capability. I am talking about intelligent technology that can understand the intention both stated or implied in words, and an intelligence with knowledge that goes deep in a particular area.
Imagine a kitchen assistant that knows a lot about cooking. The assistant may not know anything about Julia Child’s biography and won’t help your children with homework, but it could be extremely useful for task-based, collaborative problem solving when you’re cooking. If you’re wondering why your chocolate sauce is grainy and separated, your cooking agent could help you figure out that it has curdled and what you did wrong. It could then walk you through the steps to try to save it. Also, consider the possibilities if such an assistant could maintain a longer-term context. You might ask a question like, “What was that rice dish I made a few weeks ago?” and get back the response, “Do you mean the one with the Mediterranean spices or the chicken pilaf?”
One of the biggest enablers of an increase in spoken language systems is actually the confluence of several technologies that are all currently enjoying their own rapid advances. Besides speech recognition and speech generation, conversational technologies are also benefiting from advances in knowledge modeling, spoken language understanding, natural language generation, and decision making — machine learning, and more recently deep learning, are accelerating progress in all of these areas.
Computing resources that can handle faster and better parallel processing and can access data at very large scale are contributing. Individual algorithms specific to speech and dialog are getting better. Both Google and Microsoft have announced significant progress in their speech processing technologies using deep learning that lets them take advantage of simpler algorithms that require less human tuning.
Demand for dialog-based systems will also drive adoption and push further development, both due to the number of devices and the increasing complexity of the things we want to do with them. There seems to be no end to the number of new and smaller devices on the market. Mobile software designers have done a great job adapting GUIs to cell phones and other devices, but even these adaptations can come up short, as the things we want to do with ever shrinking devices get more complicated. Some smart watches, for example, have small touchscreens, but otherwise they’re usually linked to a companion phone for the user interface.
A sophisticated enough conversational interface could obviate the need to carry a smart phone altogether. Although Google Glass has been pulled off the market for the time being, it’s still an example of a device that depends on voice controls to exist. We can expect to see more new devices that wouldn’t be possible without a voice interface. The Internet of Things (IoT) is generally associated with the small scale. People don’t think of user interfaces with IoT objects since they’re usually too small to accommodate UI controls, but microphones can be tiny and easily fit into such devices. In other areas, conversational companion robots are being considered to help with things like medicine compliance.
The context — when and how — we interact with technology is also steering user interactions toward conversation. Voice interfaces provide hands-free and, just as significantly, eyes-free control. Driving is an obvious example of a situation where you don’t want to interact with a keyboard and screen because you’re doing something else. Car manufacturers are thinking beyond simple voice control for accessories. They are imagining intelligent driving companions that talk to you about what music you want to hear, discuss which nearby restaurant you’re in the mood for, or try to keep you alert when you start to feel drowsy. All of this can happen with both hands on the wheel and your eyes on the road. Researchers and developers are also working on contextually aware systems that can anticipate your needs, but in many cases there’s an easy way for a system to know what you need. You tell it.
Should everything be voice controlled?
Still, conversational interfaces are not for everything. In circumstances where bandwidth and processing power are constrained, voice processing might not be possible. There are situations when talking out loud is not reasonable, such as some public places for example. Speaking may not always be the best way to interact with a device. There are tasks and goals where the best interface might be to press a button without having to say anything. In fact, if you’re designing a voice-enabled system, you should compare your solution to one with a GUI or physical controls to make sure speech really enhances the user experience.
In the future, we can expect to see much more natural interactions with computers. GUIs helped propel personal computers into widespread use; CUIs could be the enabler of all manner of devices and capabilities, even those we haven’t conceived of yet. Couple conversational interaction with better context awareness, like knowing about your environment and nearby resources, and there are bound to be some very interesting applications. I don’t know what they’ll be, but I’m guessing some will be tremendously useful, some frivolous, and some probably creepy (we don’t like things to be too smart).
I’ve always been interested in the combination of humans and technology to address complex and serious issues in the world. Being able to easily interact with very sophisticated tools can only strengthen our ability to tackle fundamental problems. It’s hard to know exactly what to expect, but I expect it to be very interesting.