Ilya Sutskever is a research scientist at Google and the author of numerous publications on neural networks and related topics. Sutskever is a co-founder of DNNresearch and was named Canada’s first Google Fellow.

Key Takeaways:

Since humans can solve perception problems very quickly, despite our neurons being relatively slow, moderately deep and large neural networks have enabled machines to succeed in a similar fashion.
Unsupervised learning is still a mystery, but a full understanding of that domain has the potential to fundamentally transform the field of machine learning.
Attention models represent a promising direction for powerful learning algorithms that require ever less data to be successful on harder problems.

David Beyer: Let’s start with your background. What was the evolution of your interest in machine learning, and how did you zero-in on your Ph.D. work?

Ilya Sutskever: I started my Ph.D. just before deep learning became a thing. I was working on a number of different projects, mostly centered around neural networks. My understanding of the field crystallized when collaborating with James Martens on the Hessian-free optimizer. At the time, greedy layer-wise training (training one layer at a time) was extremely popular. Working on the Hessian-free optimizer helped me understand that if you just train a very large and deep neural network on a lot of data, you will almost necessarily succeed.

Taking a step back, when solving naturally occurring machine learning problems, you use some model. The fundamental question is whether you believe that this model can solve the problem for some setting of its parameters. If the answer is no, then the model will not get great results, no matter how good its learning algorithm. If the answer is yes, then it’s only a matter of getting the data and training it. And this is, in some sense, the primary question. Can the model represent a good solution to the problem?

There is a compelling argument that large, deep neural networks should be able to represent very good solutions to perception problems. It goes like this: human neurons are slow, and yet humans can solve perception problems extremely quickly and accurately. If humans can solve useful problems in a fraction of a second, then you should only need a very small number of massively-parallel steps in order to solve problems like vision and speech recognition. This is an old argument — I’ve seen a paper on this from the early 80s.

This suggests that if you train a large, deep neural network with 10 or 15 layers on something like vision, then you could basically solve it. Motivated by this belief, I worked with Alex Krizhevsky toward demonstrating it. Alex had written an extremely fast implementation of 2D convolutions on a GPU, at a time when few people knew how to code for GPUs. We were able to train neural networks larger than ever before and achieve much better results than anyone else at the time.

Nowadays, everybody knows that if you want to solve a problem, you just need to get a lot of data and train a big neural net. You might not solve it perfectly, but you can definitely solve it better than you could have possibly solved it without deep learning.

DB: Not to trivialize what you’re saying, but you say throw a lot of data at a highly parallel system, and you’ll basically figure out what you need?

IS: Yes, but: although the system is highly parallel, it is its sequential nature that gives you the power. It’s true we use parallel systems because that’s the only way to make it fast and large. But if you think of what depth represents — depth is the sequential part.

And if you look at our networks, you will see that each year they are getting deeper. It’s amazing to me that these very vague, intuitive arguments turned out to correspond to what is actually happening. Each year the networks that do best in vision are deeper than they were before. Now we have 25-layer computational steps, or even more, depending on how you count.

DB: What are the open problems, theoretically, in making deep learning as successful as it can be?

IS: The huge open problem would be to figure out how you can do more with less data. How do you make this method less data-hungry? How can you input the same amount of data, but better formed?

This ties in with the one of greatest open problems in machine learning — unsupervised learning. How do you even think about unsupervised learning? How do you benefit from it? Once our understanding improves and unsupervised learning advances, this is where we will acquire new ideas, and see a completely unimaginable explosion of new applications.

DB: What’s our current understanding of unsupervised learning? And how is it limited in your view?

IS: Unsupervised learning is mysterious. Compare it to supervised learning. We know why supervised learning works. You have a big model, and you’re using a lot of data to define the cost — the training error — which you minimize. If you have a lot of data, your training error will be close to your test error. Eventually, you get to a low test error, which is what you wanted from the start.

But I can’t even articulate what it is we want from unsupervised learning. You want something; you want the model to understand, whatever that means. Although we currently understand very little about unsupervised learning, I am also convinced that the explanation is right under our noses.

DB: Are you aware of any promising avenues that people are exploring toward a deeper, conceptual understanding of why unsupervised learning does what it does?

IS: There are plenty of people trying various ideas, mostly related to density modeling or generative models. If you ask any practitioner how to solve a particular problem, they will tell you to get the data and apply supervised learning. There is not yet an important application where unsupervised learning makes a profound difference.

DB: Do we have any sense of what success means? Even a rough measure of how well an unsupervised model performs?

IS: Unsupervised learning is always a means for some other end. In supervised learning, the learning itself is what you care about. You’ve got your cost function, which you want to minimize. In unsupervised learning, the goal is always to help some other task, like classification or categorization. For example, I might ask a computer system to passively watch a lot of YouTube videos (so unsupervised learning happens here), then ask it to recognize objects with great accuracy (that’s the final supervised learning task).

Successful unsupervised learning enables the subsequent supervised learning algorithm to recognize objects with accuracy that would not be possible without the use of unsupervised learning. It’s a very measurable, very visible notion of success. And we haven’t achieved it yet.

DB: What are some other areas where you see exciting progress?

IS: A general direction that I believe to be extremely important is: are learning models capable of more sequential computations? I mentioned how I think that deep learning is successful because it can do more sequential computations than previous (“shallow”) models. And so models that can do even more sequential computation should be even more successful because they are able to express more intricate algorithms. It’s like allowing your parallel computer to run for more steps. We already see the beginning of this, in the form of attention models.

DB: And how do attention models differ from the current approach?

IS: In the current approach, you take your input vector and give it to the neural network. The neural network runs it, applies several processing stages to it, and then gets an output. In an attention model, you have a neural network, but you run the neural network for much longer. There is a mechanism in the neural network, which decides which part of the input it wants to “look” at. Normally, if the input is very large, you need a large neural network to process it. But if you have an attention model, you can decide on the best size of the neural network, independent of the size of the input.

DB: So then, how do you decide where to focus this attention in the network?

IS: Say you have a sentence, a sequence of, say, 100 words. The attention model will issue a query on the input sentence and create a distribution over the input words, such that a word that is more similar to the query will have higher probability, and words that are less similar to the query will have lower probability. Then you take the weighted average of them. Since every step is differentiable, we can train the attention model where to look with backpropagation, which is the reason for its appeal and success.

DB: What kind of changes do you need to make to the framework itself? What new code do you need to insert this notion of attention?

IS: Well, the great thing about attention, at least differentiable attention, is that you don’t need to insert any new code to the framework. As long as your framework supports element-wise multiplication of matrices or vectors, and exponentials, that’s all you need.

DB: So, attention models address the question you asked earlier: how do we make better use of existing power with less data?

IS: That’s basically correct. There are many reasons to be excited about attention. One of them is that attention models simply work better, allowing us to achieve better results with less data. Also, bear in mind that humans clearly have attention. It is something that enables us to get results. It’s not just an academic concept. If you imagine a really smart system, surely, it, too, will have attention.

DB: What are some of the key issues around attention?

IS: Differentiable attention is computationally expensive because it requires accessing your entire input at each step of the model’s operation. And this is fine when the input is a sentence that’s only, say, 100 words, but it’s not practical when the input is a 10,000-word document. So, one of the main issues is speed. Attention should be fast, but differentiable attention is not fast. Reinforcement learning of attention is potentially faster, but training attentional control using reinforcement learning over thousands of objects would be non-trivial.

DB: Is there an analog, in the brain, as far as we know, for unsupervised learning?

IS: The brain is a great source of inspiration if looked at correctly. The question of whether the brain does unsupervised learning or not, depends to some extent on what you consider to be unsupervised learning. In my opinion, the answer is unquestionably yes. Look at how people behave, and notice that people are not really using supervised learning at all. Humans never use any supervision of any kind. You start reading a book, and you understand it, and all of a sudden you can do new things that you couldn’t do before. Consider a child, sitting in class. It’s not like the student is given a lot of input/output examples. The supervision is extremely indirect; so, there’s necessarily a lot of unsupervised learning going on.

DB: Your work was inspired by the human brain and its power. How far does the neuroscientific understanding of the brain extend into the realm of theorizing and applying machine learning?

IS: There is a lot of value of looking at the brain, but it has to be done carefully, and at the right level of abstraction. For example, our neural networks have units that have connections between them, and the idea of using slow interconnected processors was directly inspired by the brain. But it is a faint analogy.

Neural networks are designed to be computationally efficient in software implementations rather than biologically plausible. But the overall idea was inspired by the brain, and was successful. For example, convolutional neural networks echo our understanding that neurons in the visual cortex have very localized perceptive fields. This is something that was known about the brain, and this information has been successfully carried over to our models. Overall, I think there is value in studying the brain if done carefully and responsibly.

Post topics: Artificial Intelligence

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Try the O’Reilly learning platform

Try a course for free

Get the Radar Trends newsletter

Thank you for subscribing to the O’Reilly Radar Trends to Watch newsletter.

Unsupervised learning, attention, and other mysteries