Chapter 1. Introduction to MLOps and the AI Life Cycle
In the wake of the global health crisis of 2020, the question of scaling AI in the enterprise has never been more pressing. As many industries try to cope with the instability of a changing landscape, data science, machine learning (ML), and AI have moved from experimental initiatives to necessities.
Despite the growing need for AI to bring a newfound agility to a post-pandemic world, businesses still struggle to pivot their operations around these technologies precisely because it’s not simply a matter of technology; processes and people are also critically important. This report will introduce the data science, ML, and AI project life cycle so that readers can understand what (and who) drives these projects before covering MLOps (short for machine learning operations), a process that brings the required agility and allows for massive scaling of AI initiatives across the enterprise.
Why Are AI Projects So Complex to Execute?
It’s important to understand the challenges that AI projects present in order to properly address and overcome them with good MLOps practices. So, why are AI projects so complex, and why do so many organizations struggle to execute them (even those that succeed in other complex processes and in software development)?
There are two fundamental reasons for this.
Business Needs (and Data) Are Not Static
Not only is data constantly changing, but business needs shift as well. Results of ML models (i.e., mathematical models based on sample data that output predictions—Chapter 2 covers what ML is more in detail) need to be continually relayed back to the business to ensure that the reality of the model aligns with expectations and—critically—addresses the original problem or meets the original goal.
For example, take this (unfortunately) common scenario: let’s say a data team is presented with a business problem, and the team has six months to come up with a solution. The team spends months cleaning data, building models, and refining information into visualizations according to the initial project parameters.
Six months later, the data team presents their work to the business team, and the response is, “Great! Unfortunately, since the start of the project, the original data has changed and so has the behavior of our customers.” That’s six months of wasted effort and time, and it’s back to the drawing board.
Perhaps four additional months pass as the data is refined and tweaked again, only to be told that the original project parameters have changed yet again. Rinse, repeat. The vicious circle has only just begun, and with no particular end in sight. The expiration of data (and the changing nature of business, especially in the context of the 2020 health crisis) constantly invalidates models’ relevance. If data teams work in a bubble, then their solutions won’t be relevant or provide value outside that bubble.
Not Everyone Speaks the Same Language
Even though AI projects involve people from the business, data science, and IT teams, none of these groups are using the same tools or even—in many cases—sharing the same fundamental skills to serve as a baseline of communication.
Some symptoms of serious communication barriers include:
- First communication between teams at the end of the project
-
Successful data teams and data projects involve experts in IT, business, and data science from the start. Pulling in expertise at the last minute when most of the work is already done is extremely costly and is a sign of larger organizational issues around AI projects.
- Lack of strong leadership
-
If team leaders don’t support horizontal collaboration (between team members with the same profile or background—for example, data scientists) as well as vertical collaboration (between different types of profiles, like between business and IT), AI projects are doomed to fail.
- Problems with tracking and versioning
-
It doesn’t take long for email threads to grow in length. Using email to share files is a recipe for disaster when it comes to keeping track of content and for data versioning. Expect the loss of data and noninclusion of key stakeholders.
- Lack of strong data governance policies
-
Organizations typically implement policies for the sharing of content and data protection, but “shadow IT,” or the deployment of other policies or systems outside of a central team (which can differ widely across the organization) can, again, be a sign of deeper issues with the organizational structure around the AI project life cycle.
Other Challenges
In addition to these two primary challenges, there are many other smaller inefficiencies that prevent businesses from being able to scale AI projects (and for which, as we’ll see later in this report, MLOps provides solutions). For example, the idea of reproducibility: when companies do not operate with clear and reproducible workflows, it’s very common for people working in different parts of the company to unknowingly be working on creating exactly the same solution.
From a business perspective, getting to the 10th or 20th AI project or use case usually still has a positive impact on the balance sheet, but eventually, the marginal value of the next use case is lower than the marginal costs (see Figures 1-1 and 1-2).
One might see these figures and conclude that the most profitable way to approach AI projects is to only address the top 5 to 10 most valuable use cases and stop. But this does not take into account the continued cost of AI project maintenance.
Adding marginal cost to the maintenance costs will generate negative value and negative numbers on the balance sheet. It is, therefore, economically impossible to scale use cases, and it’s a big mistake to think that the business will be able to easily generalize Enterprise AI everywhere by simply taking on more AI projects throughout the company.
Ultimately, to continue seeing returns on investment (ROI) in AI projects at scale, taking on exponentially more use cases, companies must find ways to decrease both the marginal costs and incremental maintenance costs of Enterprise AI. Robust MLOps practices, again, are one part of the solution.
On top of the challenges of scaling, a lack of transparency and lack of workflow reusability generally mean there are poor data governance practices happening. Imagine if no one understands or has clear access to work by other members of the data team—in case of an audit, figuring out how data has been treated and transformed as well as what data is being used for which models becomes nearly impossible. With members of the data team leaving and being hired, this becomes exponentially more complicated.
For those on the business side, taking a deeper look into the AI project life cycle and understanding how—and why—it works is the starting point to addressing many of these challenges. It helps bridge the gap between the needs and goals of the business and those of the technical sides of the equation to the benefit of the Enterprise AI efforts of the entire organization.
The AI Project Life Cycle
Looking at the data science, ML, and AI project life cycle—henceforth shortened to AI project life cycle—can help contextualize these challenges. In practice, how does one go from problem to solution? From raw data to AI project?
Surface level, it seems straightforward (see Figure 1-3): start with a business goal, get the data, build a model, deploy, and iterate. However, it’s easy to see how managing multiple AI projects throughout their life cycle, especially given the aforementioned challenges, can quickly become difficult in and of itself.
Even though ML models are primarily built by data scientists, that doesn’t mean that they own the entire AI project life cycle. In fact, there are many different types of roles that are critical to building AI projects, including most notably:
- Subject matter experts on the business side
-
While the data-oriented profiles (data scientist, engineer, architect, etc.) have expertise across many areas, one area where they tend to lack is a deep understanding of the business and the problems or questions at hand that need to be addressed using ML.
- Data scientists
-
Though most see data scientists’ role in the ML model life cycle as strictly the model-building portion, it is actually—or at least, it should be—much wider. From the very beginning, data scientists need to be involved with subject matter experts, understanding and helping to frame business problems in such a way that they can build a viable ML solution.
- Architects
-
AI projects require resources, and architects help properly allocate those resources to ensure optimal performance of ML models. Without the architect role, AI projects might not perform as expected once they are used.
- Software engineers and traditional DevOps
-
Software engineers usually aren’t building ML models, but on the other hand, most organizations are not producing only ML models. When it comes to deploying AI projects into the larger business and making sure they work with all the other non-AI systems, these roles are critically important.
After considering all these different roles plus breaking down the steps of the AI life cycle more granularly, the picture becomes much more complex (see Figure 1-4).
Given the complexity of the nature of AI projects themselves, the AI project life cycle in the organization, and the number of people across the business that are involved, companies looking to scale AI efforts need a system that keeps track of all the intricacies. That’s where MLOps comes into play.
At its core, MLOps is the standardization and streamlining of data science, ML, and AI project life cycle management. For most traditional organizations, working with multiple ML models is still relatively new.
Until recently, the number of models may have been manageable at a small scale, or there was simply less interest in understanding these models and their dependencies at a company-wide level. Now, the tables are turning and organizations are increasingly looking for ways to formalize a multistage, multidiscipline, multiphase process with a heterogeneous environment and a framework for MLOps best practices, which is no small task.
The Role of MLOps in the AI Project Life Cycle
Some believe that deploying ML models in production (i.e., feeding them real data and making them a part of business operations) is the final step—or one of the final steps—of the AI project life cycle. This is far from the case; in fact, it’s often just the beginning of monitoring their performance and ensuring that they behave as expected.
MLOps isn’t one specific step in the life cycle or a check along the way before passing from one step to another. Rather, MLOps is an underlying process that encompasses and informs all of the steps in the AI project life cycle, helping the organization:
- Reduce risk
Using ML models to drive automatic business decisions without MLOps infrastructure is risky for many reasons, first and foremost because fully assessing the performance of an ML model can often only be done in the production environment. Why? Because prediction models are only as good as the data they are trained on, which means if—and more like when—data changes, the model performance is likely to decrease rapidly. This translates to any number of undesirable business results, from bad press to poor customer experience.
- Introduce transparency
MLOps is a critical part of transparent strategies for ML. Upper management, the C-suite, and data scientists should all be able to understand what ML models are being used by the business and what effect they’re having. Beyond that, they should arguably be able to drill down to understand the whole data pipeline behind those ML models. MLOps, as described in this report, can provide this level of transparency and accountability.
- Build Responsible AI
The reality is that introducing automation vis-à-vis ML models shifts the fundamental onus of accountability from the bottom of the hierarchy to the top. That is, decisions that were perhaps previously made by individual contributors who operated within a margin of guidelines (for example, what the price of a given product should be or whether or not a person should be accepted for a loan) are now being made by a machine. Given the potential risks of AI projects as well as their particular challenges, it’s easy to see the interplay between MLOps and Responsible AI: teams must have good MLOps principles to practice Responsible AI, and Responsible AI necessitates MLOps strategies.
- Scale
MLOps is important not only because it helps mitigate the risk, but also because it is an essential component to scaling ML efforts (and in turn benefiting from the corresponding economies of scale). To go from the business using one or a handful of models to tens, hundreds, or thousands of models that positively impact the business requires MLOps discipline.
Each of these points is an important yet challenging part of the transformation of the organization around data. The next section will go more in depth on the rise of MLOps and the role it plays in the organization’s success in AI initiatives.
MLOps: What Is It, and Why Now?
Machine learning is not new, and neither is its use in business contexts. So why is MLOps—or the systematic streamlining of AI projects—becoming a popular topic now (see Figure 1-5)? Until recently, teams have been able to get by without defined and centralized MLOps processes mostly because, at an enterprise level, they weren’t leveraging ML models on a large enough scale.
That’s not to say that MLOps is only important for organizations creating lots of AI projects. In fact, MLOps is important to any team that has even one model in production, as depending on the model, continuous monitoring and adjusting is essential. Think about a travel site whose pricing model would require top-notch MLOps to ensure that the model is continuously delivering business results and not causing the company to lose money.
However, MLOps really tips the scales as critical for risk mitigation when a centralized team (with unique reporting of its activities, meaning that there can be multiple such teams at any given enterprise) has more than a handful of operational models. At this point, it becomes difficult to have a global view of the states of these models without some standardization.
This Sounds Familiar...
If the definition (or even the name MLOps) sounds familiar, that’s because it pulls heavily from the concept of DevOps, which streamlines the practice of software changes and updates. Indeed, the two have quite a bit in common. For example, they both center around:
- Robust automation and trust between teams
- The idea of collaboration and increased communication between teams
- The end-to-end service life cycle (build-test-release)
- Prioritizing continuous delivery as well as high quality
Yet there is one critical difference between MLOps and DevOps that makes the latter not immediately transferable to data science teams, and it relates to one of the challenges presented in the beginning of this chapter: deploying software code in production is fundamentally different than deploying ML models into production.
While software code is relatively static, data is always changing, which means ML models are constantly learning and adapting—or not, as the case may be—to new inputs. The complexity of this environment, including the fact that ML models are made up of both code and data, is what makes MLOps a new and unique discipline.
Key Components of a Robust MLOps Practice
Good MLOps practices will help teams on both the business and tech side at a minimum:
- Keep track of different model versions, i.e., different variations of models with the same ultimate business goal to test and find the best one
- Understand if new versions of models are better than the previous versions (and promoting models to production that are performing better)
- Ensure (at defined periods—daily, monthly, etc.) that model performance is not degrading
At a more detailed level, there are five key components of MLOps: development, deployment, monitoring, iteration, and governance. The bulk of this report will cover at a high level the three components that are most important for those on the business side to understand (both conceptually and in terms of the role of the business in those components): development, monitoring, and governance.1
Closing Thoughts
MLOps is critical—and will only continue to become more so—to both scaling AI across an enterprise as well as ensuring it is deployed in a way that minimizes risk. Both of these are goals with which business leaders should be deeply concerned.
While certain parts of MLOps can be quite technical, it’s only in streamlining the entire AI life cycle that the business will be able to develop AI capabilities to scale their operations. That’s why business leaders should not only understand the components and complexities of MLOps, but have a seat at the table when deciding which tools or processes the organization will follow and use to execute.
The next chapter is the first to dive into the detail of MLOps, starting with the development of ML models themselves. Again, the value of understanding MLOps systems at this level of detail for business leaders is to be able to drive efficiencies from business problems to solutions. This is something to keep in mind throughout Chapters 2–4.
1 This report will cover the other two components (deployment and iteration) only at a high level. Those looking for more detail on each component should read Introducing MLOps (O’Reilly).
Get What Is MLOps? now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.