In this episode of the Data Show, I spoke with Soumith Chintala, AI research engineer at Facebook. Among his many research projects, Chintala was part of the team behind DCGAN (Deep Convolutional Generative Adversarial Networks), a widely cited paper that introduced a set of neural network architectures for unsupervised learning. Our conversation centered around PyTorch, the successor to the popular Torch scientific computing framework. PyTorch is a relatively new deep learning framework that is fast becoming popular among researchers. Like Chainer, PyTorch supports dynamic computation graphs, a feature that makes it attractive to researchers and engineers who work with text and time-series.
Here are some highlights from our conversation:
The origins of PyTorch
TensorFlow addressed one part of the problem, which is quality control and packaging. It offered a Theano style programming model, so it was a very low-level deep learning framework. ... There are a multitude of front ends that are trying to cope with the fact that TensorFlow is a very low-level framework—there's TF-slim, there's Keras. I think there's like 10 or 15, and just from Google there's probably like four or five of those.
On the Torch side, the philosophy has always been slightly different than Theano. I see TensorFlow as a much better Theano-style framework, and on the Torch side we had a philosophy that we want to be imperative, which means that you run your computation immediately. Debugging should be butter smooth. The user should never have trouble debugging their programs, whether they use a Python debugger or something like the GDB or something else.
... Chainer was a huge inspiration. PyTorch is inspired primarily by three frameworks. Within the Torch community, certain researchers from Twitter built an auxiliary package called Autograd, and this was actually based on a package called Autograd in the Python community. Like Chainer, Autograd and Torch Autograd, all used a certain technique called tape-based automatic differentiation: that is, you have a tape recorder that records what operations you have performed and then it replays it backward to compute your gradients. This is a technique that is not used by any of the other major frameworks except PyTorch and Chainer. All of the other frameworks use what we call a static graph—that is, the user builds a graph, then they give that graph to an execution engine that is provided by the framework, and the framework executes it. It can analyze it ahead of time.
These are very two different techniques. The tape-based differentiation gives you easier debuggability, and it gives you certain things that are more powerful (e.g., dynamic neural networks). The static graph-based approach gives you easier deployment to mobile, easier deployment to more exotic architectures, the ability to do compiler techniques ahead of time, and so on.
Deep learning frameworks within Facebook
Internally at Facebook, we have a unified strategy. We say PyTorch is used for all of research and Caffe 2 is used for all of production. This makes it easier for us to separate out which team does what and which tools do what. What we are seeing is, users first create a PyTorch model. When they are ready to deploy their model into production, they just convert it into a Caffe 2 model, then ship into either mobile or another platform.
PyTorch user profiles
PyTorch has gotten its biggest adoption from researchers, and it's gotten about a moderate response from data scientists. As we expected, we did not get any adoption from product builders because PyTorch models are not easy to ship into mobile, for example. We also have people who we did not expect to come on board, like folks from OpenAI and several universities.