The D.E. Shaw Supercomputer, Anton.
The D.E. Shaw Supercomputer, Anton. (source: Matt Simmons on Flickr).

Specialists describe deep learning as akin to a rocketship that needs a really big engine (a model) and a lot of fuel (the data) in order to go anywhere interesting. To get a better understanding of the issues involved in building compute systems for deep learning, I spoke with one of the foremost experts on this subject: Greg Diamos, senior researcher at Baidu. Diamos has long worked to combine advances in software and hardware to make computers run faster. In recent years, he has focused on scaling deep learning to help advance the state-of-the-art in areas like speech recognition.

A big model, combined with big data, necessitates big compute—and at least at the bleeding edge of AI, researchers have gravitated toward high-performance computing (HPC) or supercomputer-like systems. Most practitioners use systems with multiple GPUs (ASICs or FPGAs) and software libraries that make it easy to run fast deep learning models on top of them.

In keeping with the convenience versus performance tradeoff discussions that play out in many enterprises, there are other efforts that fall more in the big data, rather than HPC, camp. In upcoming posts, I’ll highlight groups of engineers and data scientists who are starting to use these techniques and are creating software to run them on existing software and hardware infrastructure common in the big data community.

What works as far devising systems capable of training large deep learning models on big data? There are few common patterns that have emerged in the deep learning research community:

  • Use dense compute hardware equipped with multiple GPUs (or other high-throughput parallel processors). While GPUs are notoriously hard to program, deep learning is somewhat easier, as it tends to come down to big, dense, linear algebra.

  • Deploy fast interconnects and combine them with software and algorithms (like MPI) that can take advantage of fast networks. One can also try to reduce the total amount of communication between nodes in a cluster by using algorithms like asynchronous SGD.

  • Take advantage of optimized libraries for algorithms and computations—such as linear algebra, FFT and convolutions—required in deep learning. Nvidia and Intel have released some open source libraries for this purpose, but many research groups have also developed their own tools.

  • Consider using specialized IO systems that can keep up with the volume of random reads required in large deep learning workloads.

  • One can also try using lower precision while training models (an active research area) or reducing the size of models (research shows compression and regularization work well after a model has already been trained).


In our interview, Diamos chimed in on the question of whether it’s possible to build an ASIC that's more efficient than a GPU:

It's an interesting question. People who design hardware are still wrestling with this. I don't think that there's a clear answer on it. I personally think it's possible to build something that's better than a GPU, but it requires a lot of forward-looking research technologies to materialize, many of them related to processes like manufacturing. So, I think it's just a long shot right now to build an ASIC for deep learning. It doesn't mean that people aren't trying. I really hope they are successful because it would have a very high impact.

… I think you'd have to be very creative. It's not really obvious to people what an ASIC would look like that would be better than a GPU. I think people have a few main ideas. One of the big ideas floating around is reduced precision. The problem with that is that GPUs are also adding support for that. The advantage of an ASIC over a GPU is just shrinking as time goes on.

Another piece of this is just that hardware design has gotten really complicated and really expensive. If you have a great idea for an ASIC, there's a huge capital investment. There's a huge amount of risk because there are so many different technologies that have to be successfully executed to compete with a very fast processor running in a bleeding edge process, like 14 nanometer, 10 nanometer. The combinations of all of these things make it very risky, even if you do have a good idea.

There have been a couple of proposals in the research community about how to to build an ASIC. The one I like the best is based around 3D integration, in the sense of gluing memories, in a really tightly connected way, to processors. The problem with this is that it only really gives you a big advantage if you have a process that can support it, and, as far as I know, that type of technology is extremely expensive. Think of it like $1 billion in new investments and multiple years away.

Related resources: