Lessons learned turning machine learning models into real products and services

Why model development does not equal software development.

By David Talby
June 5, 2018
Rendering code Rendering code (source: Pixabay)

Artificial intelligence is still in its infancy. Today, just 15% of enterprises are using machine learning, but double that number already have it on their roadmaps for the upcoming year. With public figures like Intel’s CEO stating that every company needs a machine learning strategy or risks being left behind, it’s just a matter of time before machine learning enters your organization, too—if it hasn’t already.

However, in talking with CEOs looking to implement machine learning in their organizations, there seems to be a common problem in moving machine learning from science to production. In other words, “The gap between ambition and execution is large at most companies,” as put by the authors of an MIT Sloan Management Review article. Ultimately, there’s a major difference between building a model, and actually getting it ready for people to use in their products and services.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Data science bootcamps are great for learning how to build and optimize models, but they don’t teach engineers how to take them to the next step. The end result is a bottleneck where models are being built that aren’t being turned into revenue-generating products and services. So what should an organization keep in mind before implementing a machine learning solution?

Models degrade in accuracy as soon as they are put in production

The biggest mistake people make with regard to machine learning is thinking that the models are just like any other type of software. Once a model is built and goes live, people assume it will continue working as normal. However, while machine learning machine learning is designed to get smarter over time, models will actually degrade in quality—and fast—without a constant feed of new data. Known as concept drift, this means that the predictions offered by static machine learning models become less accurate, and less useful, as time goes on. In some cases, this can even happen in a matter of days.

As such, organizations need to recognize there is never a final version of a machine learning model, and that it will need to be updated and improved over time. This requires organizations to keep engineers on projects even after the models are built to ensure not only that the models stay live, but also that they stay accurate. While big data and machine learning engineers are in high demand, and thus expensive, they are important because they are the ones responsible for regularly retraining the models to provide accurate predictions and recommendations. Some of this work can be automated, but doing so still requires expertise and custom development.

So how often should models be retrained? It depends on what they’re predicting. In areas like cybersecurity or real-time trading, for example, where change is constant, models may need to be updated continuously. On the other hand, voice recognition or other physical models can be retrained less frequently because their inputs generally don’t change over time.

However, no matter what the models are predicting, some level of retraining is necessary, as there are always unforeseen external changes that can affect the accuracy of machine learning models: shifts in people’s preferences; marketing campaigns; competitor moves; the weather; the news cycle; or the locations, time, or device models from which the models are used. Therefore, it is critical for organizations to know their levels of accuracy in production by setting up online feedback and accuracy measurements that are as important as monitoring servers and application health.

The exact same model can rarely be deployed twice

Another major consideration before turning machine learning models into production-grade products and services is that models often need to be localized. In other words, a model that works for one geographical location might not work for another. Demographics, languages, and tastes can vary across geographies, which is something that must be carefully taken into consideration for the models to work effectively.

Sometimes, the need to localize models is obvious. A model that recommends what sports programs to watch, for example, would need to consider that the Super Bowl is huge in the U.S., the Clasico soccer match is huge in Spain, while other countries hold their collective breath during the Cricket World Cup. But the need to localize models can be less obvious, too. Models that predict patient risk to return to a hospital within 30 days of being discharged, for example, can be very different—even across hospitals within the same city if they serve a different part of the population, accept different insurance plans, or focus on different medical specialties.

Localizing models applies to more than different geographies. Machine learning models are designed for very specific audiences, and accordingly, companies should test and measure their accuracy on different demographics to decide if and how they should be tweaked. Models designed for one group of people rarely work when applied on a larger scale. For this reason, companies need to have a deep understanding of the data and assumptions that are used to build the models, and adjust them as needed.

Failing to account for these differences can result in biased models that not only cause poor results, but maybe even a public relations disaster. Take the example of Google, whose facial recognition software confused black people with gorillas. Or personal assistants that work better with men than with women. Reusing models in health care can be a reputation hazard, too, even though human biology doesn’t change overnight. When doing anything consumer-facing, there is a need to consider these demographic differences, not just to ensure accurate results, but more importantly, to avoid creating new biases or perpetuating existing ones within society.

Measuring the online accuracy of a model—i.e., how it’s actually performing in production—is very tricky, and even the industry’s most experienced teams can get it wrong. Picking the right metric and test set to measure on requires a combination of math, business, product, technical, and ethical considerations that go beyond what each individual member of the team usually possesses. Since the issues only appear in production and only for a certain subset of users, they are “immune” to traditional forms of software testing and model validation.

Often, the real modeling work starts in production

Unlike most things, it’s easier to get started with machine learning than it is to keep going with it. In fact, building a machine learning model really isn’t too difficult—any junior data scientist or developer can do it with a good set of training data and the right tools. The hardest part of machine learning today is actually deploying and maintaining accurate models, as it requires constant access to new data to update them and improve their accuracy. In many scenarios, this data can only come once the initial models have made their way into the hands of customers.

Once customers start using a company’s machine learning models, the models are no longer using training data, but are making predictions using real data. As more and more customers begin to use a machine learning product or service, the potential to learn from customers’ feedback and their real-world data increases exponentially. This ultimately allows companies to continue building and improving their models after they are in the hands of customers—unlike software, which mostly sees minor bug fixes or infrequent upgrades after deployment.

In many use cases, the customers or competitors who are impacted by the new model will change their behavior to circumvent its predictions. This happens in models that predict fraud, multi-party competitive scenarios like online ad bidding or algo-trading, and in cybersecurity. A more recent development is directly attacking machine learning models by distorting inputs so that the models misclassify them. This is giving rise to a growing emphasis to model robustness against adversaries. Such applications highlight another reason why machine learning models degrade over time: deploying a model in a live environment inevitably changes that environment and invalidates the assumptions of the initial model in the process.

For companies, this is an important consideration to keep in mind with regard to their cost structures. Since most of the work—and the most demanding work—is done post-deployment, there is a critical need to keep the most able data scientists on the project after the models are in production. This can result in heavy, and sometimes unplanned, expenses for companies, so it is something that needs to be considered well beforehand. With this in mind, companies should be sure to set aside sufficient budget, talent, and time—and plan on the bulk of the effort to come after their software has already been released.

Tools exist to help deploy, measure, and secure models

All of these issues stem from the fact that, while as an industry software engineers have gotten a whole lot better about operating production apps and services, there is still little experience with operating machine learning solutions. A lot of focus today is still on training people to build models, while the major challenges actually come afterward.

Once a model is built, software engineers need to make it accessible through some API for use inside actual products and services. Then, they must have a way to constantly measure the accuracy of the model, and collect and act on user feedback for improvements. There’s also the question of deploying updated versions of these models and telling users why they should use them. There are real needs as well for continuous integration, continuous deployment, change management, monitoring, and security tools and controls that are specific to machine learning systems.

While deploying a machine learning model for products and services is a young, emerging field, there already are a number of tools to help. However, the tools are not the same ones companies are already using for their “traditional” software projects since they solve different problems. These tools are loosely described as data science platforms, although in 2018, there is great disparity between the functionality these tools offer.

Most of these platforms are either cloud-based or priced per user, which can make it expensive to scale or build independent in-house capability. For this reason, companies should look for a machine learning platform that includes full source code, unlimited commercial use, and turnkey implementation. It’s a worthwhile option for companies looking to build their own machine learning capability—without vendor lock-in or external dependency for such a critical infrastructure.

But tools are only as good as the people who use them. Therefore, companies should plan to build expertise in DataOps—the recently coined term for the discipline of applying DevOps principles to the requirements of data science. Leveraging consulting machine learning experts with proven hands-on experience deploying and operating machine learning products can speed up getting through the learning curve.

As more and more business get their toes wet with machine learning, there’s an urgent need to understand how they can best prepare their models for use as real, reliable, scalable, and secure products and services. Far too often, companies stall when it comes to implementation because they don’t know how, or haven’t planned for all of the factors involved in actually deploying their models. However, with better know-how and the right tools, there’s nothing holding you back from success.

Related content:

Post topics: AI & ML, Data

Get the O’Reilly Radar Trends to Watch newsletter