Network structure
Network structure (source: Denys Nevozhai on Unsplash)

Deep learning is a class of machine learning (ML) algorithms inspired by the human brain. Also called neural networks, these algorithms are especially good at detecting patterns across both noisy data and data that was once completely opaque to machines. While the technical details of neural nets may thrill mathematics and computer science Ph.D.s, the technology’s real significance has a much broader appeal. It represents one more step toward truly self-learning machines.

Not surprisingly, this new wave of algorithms has captured attention with applications that range from machine translation to self-driving cars. Enterprises—and not just web-scale digital giants—have begun to use it to solve a wide variety of problems. Early adopters are demonstrating high-impact business outcomes in fraud detection, manufacturing performance optimization, preventative maintenance, and recommendation engines. It’s becoming clear that these new machine-intelligence-powered initiatives have the potential to redefine industries and establish new winners and losers in the next five years.

Though a custom deep learning framework can provide significant value, building one does come with unique challenges. This article will address some of the hurdles that enterprises will face as they develop this technology, approaches to overcoming them, and other considerations for building and maintaining a deep learning program. Specifically we will explore:

  • Deep learning’s specialized hardware and software needs (e.g., GPUs)
  • New approaches to model interpretability
  • Considerations for building a data platform that can service deep learning
  • Automation while choosing, testing, and promoting deep learning models
  • Challenges and requirements for deep learning in production
  • The need for enterprise-grade expertise

Deep learning requires powerful processing

One of the challenges of using deep learning is the fact that the models—which sometimes run on a scale of millions of nodes—are computationally intensive, and training them efficiently requires specialized hardware and software resources.

Currently, the best-in-class option for training deep learning models are GPUs, or graphical processing units. These specialized electronic circuits were developed in the gaming industry and are particularly well-suited for doing the floating point parallel computations that deep learning requires.

This hardware is a significant step forward from CPUs, with models training in weeks on GPUs that would take months to train otherwise. However, working with GPUs can be challenging, as their architectures and compute frameworks are much different from those that are CPU only.

GPUs require significant engineering to optimize the software and ensure efficient parallelism, manageability, reliability, and portability. They must also be integrated with the rest of the analytical ecosystem, as some learning will happen within both the CPU and the GPU architectures. Scaling models over GPUs can be tricky, and doing so requires a blueprint that intelligently routes traffic so the architecture is used efficiently.

Approaching model interpretability with LIME

Aside from their intensive computational needs, another unique challenge of using neural nets is their occasional inscrutability. Neural nets use hidden layers that squirrel away information the machine uses to make decisions. Deep learning models are functionally black boxes, as it is nearly impossible to get a feeling for their inner workings. This raises the question of trust. In some industries, however, interpretability is not negotiable.

For example, financial institutions in Europe must remain in compliance with the EU’s General Data Protection Regulation (GDPR), which levies heavy financial penalties for companies that cannot explain how a customer’s data was used. In this case, it is not possible—nor is it legal—to tell a customer that their financial transaction was declined simply because the model said so. Even beyond matters of regulatory compliance, stakeholders often need to be told how a decision was reached in order to support their actions.

Though far from solved, there are several approaches that enterprises are using to address the problem of model interpretability. One is through a method called Local Interpretable Model-Agnostic Explanations (LIME), an open source body of research produced at the University of Washington. LIME sheds light on the specific variables that triggered the algorithm at the point of its decision and produces that information in a human-readable way. In the case of fraud, knowing this information can provide security from a regulatory standpoint as well as help the business understand how and why fraud is happening.

New innovations are occurring at a rapid pace as researchers attempt to address the interpretability issues and hardware necessities of working with deep learning. But even with these drawbacks, the gains of using this technology in the enterprise can be significant. Before deploying models, however, your organization must have the right data platform in place.

Building a data foundation that can service deep learning

Investing in a robust data and analytics foundation is the first step of a deep learning project. In fact, the project’s success is dependent on the data, which must be clean as well as highly available and reliable. Stale, incomplete, or inaccurate data leads to incorrect model predictions, which can get expensive and could derail an entire project.

Though not as exciting as other parts of deep learning, the majority of a deep learning project’s work will be done here—getting access to the data, making sure it’s the right kind, fixing any problems with accuracy, and developing the systems that will facilitate the models in the live environment.

Once models are in production, you will need to solve the problem of data integration in real time. Feeds of streaming data must be highly available and reliable, and have low latency for feature calculations. At the same time, batch feeds will require massive scale support, integration with data pipelines, and storage.

The system must also be able to iterate rapidly. Feature preparation needs to be in sync with the model training, including the same logic, latency, and forward compatibility. And for all data feeds and features, you must ensure visibility and traceability, integrating data quality and governance with monitoring.

Increasingly, enterprise data is spread across a hybrid cloud environment and different storage formats. Connections must be made between data residing in public clouds, on-premises data, and data that persists in different kinds of object and file storage.

Though the challenges are many, you can manage them by developing systems to monitor the data continuously so the teams that are working on the project know where the data came from, how to recreate it, and whether or not it’s current. Once this data foundation and its monitoring systems are in place, you’ll be able to leverage it for deep learning as well as extend its use to other domains.

Automation while choosing and training deep learning models

Most—if not all—of the software frameworks that are used in deep learning are open source projects, free for anyone to download and experiment with. Of these, TensorFlow is the market leader, open-sourced by Google in 2015.

Many different neural net taxonomies can be run on top of these deep learning frameworks, such as feedforward networks, generative adversarial networks, deep belief networks, and deep convolutional networks. It’s nearly impossible to keep an updated list since new types of deep learning models continue to appear at a stunning pace. Depending on your use case, there may be best practices for choosing specific architectures. However, there is no substitute for testing. Deep learning is an experimental science, not a theoretical one.

Once trained and proven via an automatic process (analytic ops), the models should be promoted to a shadow production environment where they can be adjusted or retrained. Leveraging the analytic ops process methodology also gives stakeholders a chance to become familiar with the models before they operate autonomously in a live environment.

Considerations for deep learning models in production

Similar to all other types of machine learning models, the lifecycle of a model—from development to testing to pre-production and finally to production—requires monitoring and automatic retraining. In some cases, you should also be able to have gradual deployment from pre-production to production (often done via A/B testing frameworks).

Some considerations should also be given to re-training strategies. In some cases, traditional machine learning may be able to re-train more quickly compared to several deep learning fields, specifically when the deep learning model has been trained in vast amounts of data and new data will not provide much variance, (e.g., a model trained on billions of images of people and cars). It’s also important to recognize—via live testing—when the model’s predictions match expectations based on human domain knowledge. When that’s not the case, an auto-retrain job should start, again via the analytic ops process.

For example, as a recommendation engine presents shoppers with different options, there needs to be a mechanism in place to monitor it to make sure shoppers are responding positively. At the same time, you should also be able to deploy a new engine over a percentage of the total data and compare its performance with that of the other one in real time.

None of this is easy. In fact, production is where many deep learning projects die as data science experiments due to unforeseen complexities in scaling and management. Because there are so many pitfalls, it’s important to have a team that is familiar with the challenges of working with deep learning in production. Unfortunately, people with that kind of knowledge are (currently) hard to find outside of places like Google and Facebook.

The talent deficit

Deep learning expertise is scarce and expensive. While many smart people have been able to educate themselves on neural networks and experiment with models using cloud APIs, it is difficult to find engineers who have experience with deploying deep learning at scale in an enterprise setting. In a recent Forbes article on AI, Diego Klabjan put it well when he said, “AI development has a very small talent pool, and it would be difficult to get that kind of brain trust in one organization at an affordable, sustainable rate.”

This will change as the field advances and as deep learning proves its value across more industries and use cases. In the meantime, however, one way to overcome this knowledge gap is by working with an experienced partner who knows what kinds of errors to look for. Though it might be tempting to wait until the field matures, doing so may result in falling behind.

Leveraging deep learning for transformational change

Deploying deep learning is different from adopting other kinds of software. It potentially involves the automation of decision-making at scale, and it can be disruptive, requiring you to rethink processes that were engineered prior to its deployment.

This is as it should be, as deep learning is much more than an analytic add-on to business as usual. These data products must become integral parts of the business, allowing companies to drive organizational change by harnessing the power of their data and acting on it automatically.

Enterprises that successfully deploy deep learning will see dividends in safer products, happier customers, more efficient operations, and in dozens of other use cases as the field continues to mature. Deploying it takes thoughtful (and significant) investment, cross-functional collaboration, and plenty of testing, but the results are worth it. If the enterprise is ready, deep learning can be transformative.

This post is a collaboration between Teradata and O’Reilly. See our statement of editorial independence.

Article image: Network structure (source: Denys Nevozhai on Unsplash).