Machine learning in the wild

A bridge between robust control and reinforcement learning.

By David Beyer
November 24, 2015
Owachomo bridge in Natural Bridges National Monument. Owachomo bridge in Natural Bridges National Monument. (source: By Pretzelpaws on Wikimedia Commons)

Download our free report “Future of Machine Intelligence: Perspectives from Leading Practitioners,” now available. The following interview is one of many included in the report.

Benjamin Recht is an associate professor in the electrical engineering and computer sciences department as well as the statistics department at the University of California at Berkeley. His research focuses on scalable computational tools for large-scale data analysis, statistical signal processing, and machine learning — exploring the intersections of convex optimization, mathematical statistics, and randomized algorithms.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Key takeaways:

  1. Machine learning can be effectively related to control theory, a field with roots in the 1950s.
  2. In general, machine learning looks to make predictions by training on vast amounts of data to predict the average case. On the other hand, control theory looks to build a physical model of reality and warns of the worst case (i.e., this is how the plane responds to turbulence).
  3. Combining control principles with reinforcement learning will enable machine learning applications in areas where the worst case can be a question of life or death (e.g., self driving cars).

David Beyer: You’re known for thinking about computational issues in machine learning, but you’ve recently begun to relate it to control theory. Can you talk about some of that work?

Benjamin Recht: I’ve written a paper with Andy Packard and Laurent Lessard, two control theorists. Control theory is most commonly associated with aviation or manufacturing. So you might think, what exactly does autopilot have to do with machine learning? We’re making great progress in machine learning systems, and we’re trying to push their tenets into many different kinds of production systems. But we’re doing so with limited knowledge about how well these things are going to perform in the wild.

This isn’t such a big deal with most machine learning algorithms that are currently very successful. If image search returns an outlier, it’s often funny or cute. But when you put a machine learning system in a self-driving car, one bad decision can lead to serious human injury. Such risks raise the stakes for the safe deployment of learning systems.

DB: Can you explain how terms like robustness and error are defined in control system theory?

BR: In engineering design problems, robustness and performance are competing objectives. Robustness means having repeatable behavior no matter what the environment is doing. On the other hand, you want this behavior to be as good as possible. There are always some performance goals you want the system to achieve. Performance is a little bit easier to understand — faster, more scalable, higher accuracy, etc. Performance and robustness trade off with each other: the most robust system is the one that does nothing, but the highest performing systems typically require sacrificing some degree of safety.

DB: Can you share some examples and some of the theoretical underpinnings of the work and your recent paper?

BR: The paper with Laurent and Andy noted that all of the algorithms we popularly deploy in machine learning look like classic dynamical systems that control theorists have studied since the 1950’s. Once we drew the connection, we realized we could lean on 70 years of experience analyzing these systems. Now we can examine how these machine learning algorithms perform as you add different kinds of noise and interference to their execution.

For one very popular algorithm — called the Heavy Ball method (PDF) — we discovered that if you use off-the-shelf settings, there are cases when it never converges. No one had yet produced a formal proof that the algorithm converged, but everybody assumed it worked in practice. Moreover, we were able to modify the parameters to find a regime where it always converged. What makes this analysis tool-kit so useful is that we can not only certify whether a method will work, but we can interactively manipulate a specified algorithm to make it more robust.

DB: Do you mean I can take a library of linear and non-linear algorithms, supervised and unsupervised approaches, and basically score them according to how robust they are?

BR: Yes. We’ve only done this in some very simple cases so far, but we’re hoping to expand on this work. You can plug the algorithm into this framework, and we’ll give you back an analysis as to how fast it might converge or how much noise it can reject. Then you can tune this algorithm to improve some metric of interest.

DB: Control systems that might, for example, model airplane flight, don’t derive their parameters by studying millions of hours of flight in the way we understand a classical machine learning algorithm might. How do control theorists build their models in contrast to machine learning approaches?

BR: Control is very much about building reasonable models based on understanding how a system responds to different conditions. Air passes over a wing, which will create some kind of lift. They work from these physics models of aerodynamics and then they build a control system around that to make sure you actually fly in a straight line. Now, things get complicated when you add in turbulence, but rather than build a more complicated model of turbulence here, they model this as a “black-box” disturbance. Control theory aims to build policies that keep the plane up in the air as long as the black-box disturbance isn’t too extreme.

In machine learning, I would like to decide whether or not there’s a human in front of me if, for example, I’m a self-driving car. I might use a dictionary of 15 million images, some of them labeled with “human” and some of them labeled with “not human.” My model is derived from this huge data set rather than from physical principles about how humans present themselves in a scene. One of the guiding principles of machine learning is that if you give me all the data in the universe, then I can make any prediction you need. This is also one of its main conceits.

DB: Right. Turbulence is not predictable, but it is kind of predictable. It’s predictable insofar as how the plane is going to respond. So control systems are, in a way, more deterministic.

BR: Yes, exactly. Turbulence is exactly the idea of robustness. So, you can either apply a model to turbulence, or you can just look for the worst case outcome that can happen under turbulent behavior. The latter is much easier. That’s what robust control people do. You take your uncertainty, you try to put it in a box, and you say, “That’s what uncertainty looks like.”

Now, you can build control systems without physical models. Look at what the guys at DeepMind are doing with video games. They are using techniques from reinforcement learning to outperform humans. In reinforcement learning, rather than building a model, you just play a lot of representative scenes to a control system, and you modify the controller after each interaction in such a way that you improve the performance. That’s how the machines learn to play Atari games. They just play it thousands and thousands and thousands of times and make a record of every possible thing you could do in this Atari game and then build a data-driven control policy from there. My colleague Pieter Abbeel and his students have recently made some remarkable progress using reinforcement learning and neural networks to learn locomotion and to enable robots to nimbly interact with real objects.

DB: Is there a difference between how control theorists and machine learning researchers think about robustness and error?

BR: In machine learning, we almost always model our errors as being random rather than worst-case. In some sense, random errors are actually much more benign than worst-case errors. Let’s say you’re just going to add up a sequence of numbers. Each number is either one or minus one, and we’re going to sum up 20 of them. The worst case sum — that is the largest sum — is achieved when you set all of your choices equal to one. This gets you 20. But if you flip a coin to assign the ones and minus ones, on average the sum will be zero! And, more often than not, you’ll get something on the order of five. It will be consistently smaller. The odds of getting a 20 is one in a million.

In machine learning, by assuming average-case performance, rather than worst-case, we can design predictive algorithms by averaging out the errors over large data sets. We want to be robust to fluctuations in the data, but only on average. This is much less restrictive than the worst-case restrictions in controls

DB: This ties back to your earlier point about average versus worst/best case.

BR: Exactly. It’s a huge deal. We don’t want to rely solely on worst-case analysis because that’s not going to reflect our reality. On the other hand, it would be good to have at least more robustness in our predictions and a little bit of understanding about how these are going to fare as our data varies and our data changes.

One example where my collaborators and I have been able to take advantage of randomness came in a study of Stochastic Gradient Descent (SGD). SGD is probably the most popular algorithm in machine learning, and is the foundation of how we train neural nets. Feng Niu, Chris Re, Stephen Wright, and I were able to parallelize this algorithm by taking advantage of randomness. Feng, a grad student at the time, was experimenting with some strategies to parallelize SGD. Out of frustration, he turned off the locks in his parallel code. To all of our surprise, it just worked better. It worked a lot better. Basically, we started getting linear speedups.

In trying to explain that phenomenon, we formalized something called “HOGWILD!” — a lock-free approach to parallelizing stochastic gradient descent. In the worst case, the HOGWILD! approach would degrade performance. But because the errors are random, you get dramatic speedups in practice. People picked up on the idea and started implementing it. And for a lot of the state-of-the-art deep learning models, HOGWILD! became a go-to technique.

DB: So, control theory is model-based and concerned with worst case. Machine learning is data based and concerned with average case. Is there a middle ground?

BR: I think there is! And I think there’s an exciting opportunity here to understand how to combine robust control and reinforcement learning. Being able to build systems from data alone simplifies the engineering process, and has had several recent promising results. Guaranteeing that these systems won’t behave catastrophically will enable us to actually deploy machine learning systems in a variety of applications with major impacts on our lives. It might enable safe autonomous vehicles that can navigate complex terrains. Or could assist us in diagnostics and treatments in health care. There are a lot of exciting possibilities, and that’s why I’m excited about how to find a bridge between these two viewpoints.


Editor’s note: Benjamin Recht is the co-host of Hardcore Data Science Day at the Strata + Hadoop World San Jose and New York conferences.

Post topics: AI & ML

Get the O’Reilly Artificial Intelligence Newsletter

Get the O’Reilly Artificial Intelligence Newsletter