Chapter 1. Challenges and Better Paths in Delivering ML Solutions

The most dangerous kind of waste is the waste we do not recognize.

Shigeo Shingo, leading expert on the Toyota Production System

Not everything that is faced can be changed, but nothing can be changed until it is faced.

James Baldwin, writer and playwright

Many individuals and organizations start their machine learning (ML) journey with high hopes, but the lived experiences of many ML practitioners tell us that the journey of delivering ML solutions is riddled with traps, detours, and sometimes even insurmountable barriers. When we peel back the hype and the glamorous claims of data science being the sexiest job of the 21st century, we often see ML practitioners bogged down by burdensome manual work; firefighting in production; team silos; and unwieldy, brittle, and complex solutions.

This hinders, or even prevents, teams from delivering value to customers and also frustrates an organization’s investments and ambitions in AI. As hype cycles go, many travel past the peak of inflated expectations and crash-land into the trough of disillusionment. We might see some high-performing ML teams move on to the plateau of productivity and wonder if we’ll ever get there.

Regardless of your background—be it academia, data science, ML engineering, product management, software engineering, or something else—if you are building products or systems that involve ML, you will inevitably face the challenges that we describe in this chapter. This chapter is our attempt to distill our experience—and the experience of others—in building and delivering ML-enabled products. We hope that these principles and practices will help you avoid unnecessary pitfalls and find a more reliable path for your journey.

We kick off this chapter by acknowledging the dual reality of promise and disappointment in ML in the real world. We then examine both high-level and day-to-day challenges that often cause ML projects to fail. We then outline a better path based on the principles and practices of Lean delivery, product thinking, and agile engineering. Finally, we briefly discuss why these practices are relevant to, and especially to, teams delivering Generative AI products and large language model (LLM) applications. Consider this chapter a miniature representation of the remainder of this book.

ML: Promises and Disappointments

In this section, we look at evidence of continued growth of investments and interest in ML before taking a deep dive into the engineering, product, and delivery bottlenecks that impede the returns on these investments.

Continued Optimism in ML

Putting aside the hype and our individual coordinates on the hype cycle for a moment, ML continues to be a fast-advancing field that provides many techniques for solving real-world problems. Stanford’s “AI Index Report 2022” found that in 2021, global private investment in AI totaled around $94 billion, which is more than double the total private investment even in 2019, before the COVID-19 pandemic. McKinsey’s “State of AI in 2021” survey indicated that AI adoption was continuing its steady rise: 56% of all respondents reported AI adoption in at least one function, up from 50% in 2020.

The Stanford report also found companies are continuing to invest in applying a diverse set of ML techniques—e.g., natural language understanding, computer vision, reinforcement learning—across a wide array of sectors, such as healthcare, retail, manufacturing, and financial services. From a jobs and skills perspective, Stanford’s analysis of millions of job postings since 2010 showed that the demand for ML capabilities has been growing steadily year-on-year in the past decade, even through and after the COVID-19 pandemic.

While these trends are reassuring from an opportunities perspective, they are also highly concerning if we journey ahead without confronting and learning from the challenges that have ensnared us—both the producers and consumers of ML systems—in the past. Let’s take a look at these pitfalls in detail.

Why ML Projects Fail

Despite the plethora of chart-topping Kaggle notebooks, it’s common for ML projects to fail in the real world. Failure can come in various forms, including:

Inability to ship an ML-enabled product to production
Shipping products that customers don’t use
Deploying defective products that customers don’t trust
Inability to evolve and improve models in production quickly enough

Just to be clear—we’re not trying to avoid failures. As we all know, failure is as valuable as it is inevitable. There’s lots that we can learn from failure. The problem arises as the cost of failure increases—missed deadlines, unmet business outcomes, and sometimes even collateral damage: harm to humans and loss of jobs and livelihoods of many employees who aren’t even directly related to the ML initiative.

What we want is to fail in a low-cost and safe way, and often, so that we improve our odds of success for everyone who has a stake in the undertaking. We also want to learn from failures—by documenting and socializing our experiments and lessons learned, for example—so that we don’t fail in the same way again and again. In this section, we’ll look at some common challenges—spanning product, delivery, and engineering—that reduce our chances of succeeding, and in the next section, we’ll explore ways to reduce the costs and likelihood of failure and deliver valuable outcomes more effectively.

Let’s start at a high level and then zoom in to look at day-to-day barriers to the flow of value.

High-level view: Barriers to success

Taking a high-level view—i.e., at the level of an ML project or a program of work—we’ve seen ML projects fail to achieve their desired outcomes due to the following challenges:

Failing to solve the right problem or deliver value for users

In this failure mode, even if we have all the right engineering practices and “build the thing right,” we fail to move the needle on the intended business outcomes because the team didn’t “build the right thing.” This often happens when the team lacks product management capabilities or lacks alignment with product and business. Without mature product thinking capabilities in a team, it’s common for ML teams to overlook human-centered design techniques—e.g., user testing, user journey mapping—to identify the pains, needs, and desires of users.¹

Challenges in productionizing models

Many ML projects do not see the light of day in production. A 2021 Gartner poll of roughly 200 business and IT professionals found that only 53% of AI projects make it from pilot into production, and among those that succeed, it takes an average of nine months to do so.² The challenges of productionizing ML models isn’t limited to just compute issues such as model deployments, but can be related to data (e.g., not having inference data available at suitable quality, latency, and distribution in production).

Challenges after productionizing models

Once in production, it’s common to see ML practitioners bogged down by toil and tedium that inhibits iterative experimentation and model improvements. In its “2021 Enterprise Trends in Machine Learning” report, Algorithmia reported that 64% of companies take more than one month to deploy a new model, an increase from 58% as reported in Algorithmia’s 2020 report. The report also notes 38% of organizations spend more than 50% of their data scientists’ time on deployment—and that only gets worse with scale.

Long or missing feedback loops

During model development, feedback loops are often long and tedious, and this diverts valuable time from important ML product development work. The primary way of knowing if everything works might be to manually run a training notebook or script, wait for it to complete—sometimes waiting hours—and manually wading through logs or printed statements to eyeball some model metrics to determine if the model is still as good as before. This doesn’t scale well and more often than not, we are hindered by unexpected errors and quality degradations during development and even in production.

Many models aren’t deployed with mechanisms to learn from production—e.g., data collection and labeling mechanisms. Without this feedback loop, teams forgo opportunities to improve model quality through data-centric approaches.

Brittle and convoluted codebases

ML codebases are often full of code smells—e.g., badly named variables, long and convoluted functions, tightly coupled spaghetti code—that make the code difficult to understand and therefore difficult to change. The complexity and the risk of errors and bugs grows exponentially with each feature delivered. Modifying or extending the codebase becomes a daunting task as developers need to unravel the intricacies of the convoluted codebase and related systems or pipelines.

If the ML system lacks automated tests, it becomes even more brittle. In addition, the lack of tests sows the seeds for even more complexity because nobody wants to refactor if it means they might accidentally and unknowingly introduce regressions. This all leads to longer development cycles and reduced productivity.

Data quality issues in production

We’ll illustrate this point with an example: A study in the British Medical Journal found that none of the hundreds of predictive tools that were developed to help hospitals detect COVID-19 actually worked. There were many reasons for the failure of these models, and one key theme was data quality. There was data leakage (which caused the models to appear better than they really are), mislabeled data, and distributional asymmetry between training data and actual data in production, among other reasons.

To compound the problem, the aforementioned challenges in retraining, reevaluating, retesting, and redeploying models in an automated fashion further inhibit our ability to respond to changing data distributions over time.

Inadequate data security and privacy

Data security and privacy are cross-cutting concerns that should be the responsibility of everyone in the organization, from product teams to data engineering teams and every team in between. In the context of ML, there are several unique data security and privacy challenges that can cause a product to fail. One such challenge is data poisoning, which involves injecting malicious or biased data into the training set to corrupt the model. Recall the famous (or infamous) Microsoft Tay chatbot, which was taken down within a day of release because it learned inflammatory and offensive content from users who deliberately attempted to train it to produce such responses. Or more recently with the advent of LLMs, we’ve seen prompt injection attacks causing custom chatbots to leak users’ training data and reveal system prompts.

Ethically problematic ML products

One needn’t look far to see how ML can go wrong in the wild. For example, you may have heard of Amazon’s recruitment tool that penalized resumes containing the word “women” (Amazon decommissioned the tool within a year of its release). In another example, a benchmark analysis by ProPublica found that an ML system that was used to predict recidivism had twice as high a false positive rate for Black defendants as for White defendants, and twice as high a false negative rate for White defendants.

Now that we’ve painted a high-level picture of the reasons that cause ML projects to fail, let’s take a look at the day-to-day challenges that make it hard for ML projects to succeed.

Microlevel view: Everyday impediments to success

At the microlevel—i.e., at the level of delivering features in an ML project—there are several bottlenecks that impede our ability to execute on our ideas.

This view is best seen by contrasting a user story in the agile development lifecycle under two conditions: a low-effectiveness environment and a high-effectiveness environment. In our experience, these roadblocks stem not only from our approaches to ML and engineering, but also from suboptimal collaboration workflows and unplanned work.

Lifecycle of a story in a low-effectiveness environment

Let’s journey with Dana—our book’s protagonist and ML engineer—in this scenario. The character is fictional but the pain is real:

Dana starts her day having to deal immediately with alerts for problems in production and customer support queries on why the model behaved in a certain way. Dana’s team is already suffering from alert fatigue, which means they often ignore the alerts coming in. This only compounds the problem and the number of daily alerts.
Dana checks a number of logging and monitoring systems to triage the issue, as there are no aggregated logs across systems. She manually prods the model to find an explanation for why the model produced that particular prediction for that customer. She vaguely remembers that there was a similar customer query last month but cannot find any internal documentation on how to resolve such customer queries.
Dana sends a reminder on the team chat to ask for a volunteer to review a pull request she created last week, so that it can be merged.
Finally, Dana resolves the issue and finds some time to code and picks up a task from the team’s wallboard.
The codebase doesn’t have any automated tests, so after making some code changes, Dana needs to restart and rerun the entire training script or notebook, wait for the duration of model training—40 minutes in their case—and hope that it runs without errors. She also manually eyeballs some print statements at the end to check that the model metric hasn’t declined. Sometimes, the code blows up midway because of an error that slipped in during development.
Dana wants to take a coffee break but feels guilty for doing so because there’s just too much to do. So, she makes a coffee in two minutes and sips it at her desk while working away.
While coding, Dana received comments and questions on the pull request. For example, one comment was that a particular function was too long and hard to read. Dana then switches contexts, types out a response—without necessarily updating the code—for coding design decisions they made last week, and mentions that she will create a story card to refactor this long function next time.
After investing two weeks in a solution (without pair programming), she shares it back with the team. The team’s engineering lead notes that the solution introduces too much complexity to the codebase and needs to be rewritten. He adds that the story wasn’t actually high priority in any case, and there was another story that Dana can look at instead.

Can you imagine how frustrated and demotivated Dana must feel? The long feedback cycles and context switching—between doing ML and other burdensome tasks, such as pull request reviews—limited how much she could achieve. Context switching also had a real cognitive cost that made them feel exhausted and unproductive. They sometimes log on again after office hours because they feel the pressure to finish the work—and there just wasn’t enough time in the day to complete them all.

Long feedback loops at each microlevel step lead to an overall increase in cycle time, which leads to fewer experimentation or iteration cycles in a day (see Figure 1-1). Work and effort often move backward and laterally between multiple tasks, which lead to a disrupted state of flow.

Lifecycle of a story in a high-effectiveness environment

Now, let’s take a look at how different things can be for Dana in a high-effectiveness environment:

Dana starts the day by checking the team project management tool and then attends standup where they can pick up a story card. Each story card articulates its business value, which has been validated from a product perspective and provides clarity about what they have to work on with a clear definition of done.
Dana pairs with a teammate to write code to solve the problem specified in the story card. As they are coding, they help catch each other’s blind spots, provide each other with real-time feedback—e.g., a simpler way to solve a particular problem—and share knowledge along the way.
As they code, each incremental code change is quickly validated within seconds or minutes by running automated tests—both existing tests and new tests that they write. They run the end-to-end ML model training pipeline locally on a small dataset and get feedback on whether everything is still working as expected within a minute.
If they need to do a full ML model training, they can trigger training on large-scale infrastructure from their local machine with their local code changes, without the need to “push to know if something works.” Model training then commences in an environment with the necessary access to production data and scalable compute resources.
They commit the code change, which passes through a number of automated checks on the continuous integration and continuous delivery (CI/CD) pipeline before triggering full ML model training, which can take between minutes to hours depending on the ML model architecture and the volume of data.
Dana and her pair focus on their task for a few hours, peppered with regular breaks, coffee, and even walks (separately). They can do this without a tinge of guilt because they know it’ll help them work better, and because they have confidence in the predictability of their work.
When the model training completes, a model deployment pipeline is automatically triggered. The deployment pipeline runs model quality tests and checks if the model is above the quality threshold for a set of specified metrics (e.g., accuracy, precision). If the model is of a satisfactory quality, the newly trained model artifact is automatically packaged and deployed to a preproduction environment, and the CI/CD pipeline also runs post-deployment tests on the freshly deployed artifact.
When the story card’s definition of done is satisfied, Dana informs the team, calls for a 20-minute team huddle to share context with the team, and demonstrates how the solution meets the definition of done. If they had missed anything, any teammate could provide feedback there and then.
If no further development work is needed, another teammate then puts on the “testing hat” and brings a fresh perspective when testing if the solution satisfies the definition of done. The teammate can do exploratory and high-level testing within a reasonable timeframe because most, if not all, of the acceptance criteria in the new feature have already been tested via automated tests.
Whenever business wants to, they can release the change gradually to users in production, while monitoring business and operational metrics. Because the team has maintained a good test coverage, when the pipeline is all green, they can deploy the new model to production without any feelings of anxiety.

Dana and her teammates make incremental progress on the delivery plan daily. Team velocity is higher and stabler than in the low-effectiveness environment. Work and effort generally flow forward, and Dana leaves work feeling satisfied and with wind in her hair. Huzzah!

To wrap-up the tale of two velocities, let’s zoom out and compare in Figure 1-1 the time it takes to get something done in a high-effectiveness environment (top row) and a low-effectiveness environment (bottom row).

Zooming in a little more, Table 1-1 shows the feedback mechanisms that differentiate high-effectiveness environments from low-effectiveness environments. Each row is a key task in the model delivery lifecycle, and the columns compare their relative feedback cycle times.

Table 1-1. Comparison of feedback mechanisms and time-to-feedback in high- and low-effectiveness environments
Task	Feedback loops and time to feedback for each task (in approximate orders of magnitude)
Task	High-effectiveness environment	Low-effectiveness environment
Testing if code changes worked as expected	Automated testing (~ seconds to minutes) ⬤⬤	Manual testing (~ minutes to hours) ⬤⬤⬤⬤
Testing if ML training pipeline works end to end	Training smoke test (~ 1 minute) ⬤⬤	Full model training (~ minutes to hours, depending on the model architecture) ⬤⬤⬤⬤⬤
Getting feedback on code changes	Pair programming (~ seconds to minutes) ⬤⬤	Pull request reviews (~ hours to days) ⬤⬤⬤⬤⬤⬤⬤
Understanding if application is working as expected in production	Monitoring in production (~ seconds - as it happens) ⬤	Customer complaints (~ days, or longer if not directly reported) ⬤⬤⬤⬤⬤⬤⬤

Now that we’ve painted a picture of common pitfalls in delivering ML solutions and a more effective alternative, let’s look at how teams can move from a low-effectiveness environment to a high-effectiveness environment.

Is There a Better Way? How Systems Thinking and Lean Can Help

A bad system will beat a good person every time.

W. Edwards Deming, economist and industrial engineer

In the previous section, we can see Dana in the low-effectiveness environment facing unnecessary toil and rework, which contributes to constant frustration, and possibly ultimately to burnout. The toil, frustration, and burnout that ML practitioners often face indicate that our system of work can be improved.

In this section, we’ll explore why MLOps alone is insufficient for improving the effectiveness of ML practitioners. We’ll put on a systems thinking lens to identify a set of practices required for effective ML delivery. Then we’ll look to Lean for principles and practices that can help us operate these subsystems in an interconnected way that reduces waste and maximizes the flow of value.

You Can’t “MLOps” Your Problems Away

One reflexive but misguided approach to improving the effectiveness of ML delivery is for organizations to turn to MLOps practices and ML platforms. While they may be necessary, they are definitely not sufficient on their own.

In the world of software delivery, you can’t “DevOps” or “platform” your problems away. DevOps helps to optimize and manage one subsystem (relating to infrastructure, deployment, and operations) but other subsystems (e.g., software design, user experience, software delivery lifecycle) are just as important in delivering great products.

Likewise, in ML, you can’t “MLOps” your problems away. No amount of MLOps practices and platform capabilities can save us from the waste and rework that result from the lack of software engineering practices (e.g., automated testing, well-factored design, etc.) and product delivery practices (e.g., customer journey mapping, clear user stories, etc.). MLOps and ML platforms aren’t going to write comprehensive tests for you, talk to users for you, or reduce the negative impacts of team silos for you.

In a study on 150 successful ML-driven customer-facing applications at Booking.com, done through rigorous randomized controlled trials, the authors concluded that the key factor for success is an iterative, hypothesis-driven process, integrated with other disciplines, such as product development, user experience, computer science, software engineering, causal inference, and others. This finding is aligned with our approach as well, based on our experience delivering multiple ML and data products. We have seen time and again that delivering successful ML projects requires a mul⁠tidisciplinary approach across these five disciplines: product, software engineering, data, ML, and delivery (see Figure 1-2).

To help us see the value of putting these five disciplines together—or the costs of focusing only on some disciplines while ignoring others— we can put on the lens of systems thinking. In the next section, we’ll look at how systems thinking can help uncover the interconnected disciplines required to effectively deliver ML products.

See the Whole: A Systems Thinking Lens for Effective ML Delivery

Systems thinking helps us shift our focus from individual parts of a system to relationships and interactions between all the components that constitute a system. Systems thinking gives us mental models and tools for understanding—and eventually changing—structures that are not serving us well, including our mental models and perceptions.

You may be asking, why should we frame ML product delivery as a system? And what even is a system? Donella H. Meadows, a pioneer in systems thinking, defines a system as an interconnected set of elements that is coherently organized in a way that achieves something. A system must consist of three kinds of things: elements, interconnections, and a function or purpose.

Let’s read that again in the context of delivering ML products. A system must consist of three kinds of things (see Figure 1-3):

Elements: Such as data, ML experiments, software engineering, infrastructure and deployment, users and customers, and product design and user experience
Interconnections: Such as cross-functional collaboration and production ML systems creating data for subsequent labeling and retraining
A function or purpose of the ML product: Such as helping users find the most suitable products

Our ability to see and optimize information flow in these interconnections helps us effectively deliver ML products. In contrast, teams that frame ML product delivery solely as a data and ML problem are more likely to fail because the true, holistic nature of the system (for example, user experience being a “make-or-break” consideration in the product’s success) will eventually catch up and reveal itself.

Systems thinking recognizes that a system’s components are interconnected and that changes in one part of the system can have ripple effects throughout the rest of the system. This means that to truly understand and improve a system, we need to consider the system as a whole and how all its parts work together.

Thankfully, there is a philosophy that can help us improve information flow in the interconnections between the elements of an ML delivery system, and that is Lean.

The Five Disciplines Required for Effective ML Delivery

In this section, we’ll start with a crash course of what Lean is and how it can help us deliver ML products more effectively. Then we’ll briefly explore the five disciplines that are required in ML delivery—product, delivery, software engineering, data, and ML—and describe the key principles and practices in each discipline that provide the fast feedback ML teams need to iterate toward building the right product.

As a quick caveat, each of these five disciplines warrants a book—if not a collection of books—and the principles and practices we lay out in this chapter are by no means exhaustive. Nonetheless, they form a substantial start and they are principles and practices that we bring to any ML project to help us deliver ML solutions effectively. This section will chart our path at a high level, and we’ll dive into details in the remaining chapters of the book.

What is Lean, and why should ML practitioners care?

In ML projects (as with many other software or data projects), it’s common for teams to experience various forms of waste. For example, you may have invested time and effort to get a feature “done,” only to realize eventually that the feature did not have demonstrable value for the customer. Or perhaps you may have wasted days waiting on another team in back-and-forth handoffs. Or maybe you’ve had your flow unexpectedly disrupted by defects or bugs in your product.³ All these wastes contribute to negative outcomes such as release delays and missed milestones, more work (and the feeling that there just isn’t enough time to finish all the work), stress, and consequently low team morale.

If you have experienced any of these negative outcomes, first of all, welcome to the human condition. These are challenges we’ve personally experienced and will continue to experience to some extent because no system can be 100% waste-free or noise-free.

Second of all, Lean principles and practices can help. Lean enables organizations to better serve customers by identifying customer value, and to efficiently deliver products that satisfy customer needs. By involving the voice of the customer in the development and delivery process, teams can better understand the end users’ needs and build relevant products for them. Lean helps us get better at what we do and enables us to minimize waste and maximize value.

Lean practices originated from Toyota in the 1950s. The philosophy was initially known as the Toyota Production System (TPS). James P. Womack and Daniel T. Jones later refined and popularized it as Lean principles in their book The Machine That Changed the World (Free Press). The following five Lean principles (see Figure 1-4) were key in transforming the automotive, manufacturing, and IT industries, among others:

Principle 1: Identify value: Determine what is most valuable to the customer and focus on maximizing that value.
Principle 2: Map the value stream: Identify the steps in the process that add value and eliminate those that do not.
Principle 3: Create flow: Streamline the process to create a smooth and continuous flow of work.
Principle 4: Establish pull: Use customer demand to trigger production and avoid overproduction.
Principle 5: Continuous improvement: Continuously strive for improvement and eliminate waste in all areas of the value chain.

In our experience delivering ML products, Lean steers us toward value-creating work, which then creates a positive feedback loop of customer satisfaction, team morale, and delivery momentum. For example, instead of “pushing out” features because they involve shiny technologies, we first identify and prioritize features that will bring the most value to users (principle 1) and “pull” it into our delivery flow when the demand has been established (principle 4). In contrast, in instances where we didn’t practice this, we’d end up investing time and effort to complete a feature that added complexity to the codebase without any demonstrable value. To those with keen Lean eyes, yes—you’ve just spotted waste!

Value stream mapping (principle 2) is a tool that lets us visually represent all the steps and resources involved in delivering a unit of value (e.g., a product feature) to customers. Teams can use this tool to identify waste, work toward eliminating waste, and improve the flow of value (principle 3).

To map your team or product’s value stream, you can follow these steps:

Identify the product or service being mapped. This could be a single product or an entire process.
Identify the current state map. Create a visual representation of the current process, including all steps and materials (including time and labor) involved from raw materials to finished product.
Identify value-added and non-value-added activities. Determine which steps add value to the product or service and which do not.
Identify waste. Look for areas of overproduction, waiting, defects, overprocessing, excess inventory, unnecessary motion, excess transport, unnecessary use of raw materials, and unnecessary effort.
Create a future state map. Based on the analysis of the current state map, redesign the process to eliminate waste and create a more efficient flow of materials and information.
Implement changes. Put the redesigned process into practice and continuously monitor and improve (principle 5).

Now that we have a basic working knowledge of Lean, let’s look at how Lean intersects with the five disciplines to create a set of practices that can help ML teams shorten feedback loops and rapidly iterate toward a valuable product. When put together, these practices help create several emergent, desirable, and mutually reinforcing characteristics in our system of delivering ML products: faster feedback, cheaper and fewer failures, predictable delivery, and most importantly, valuable outcomes.

Note

If you find the explanations of each practice to be too brief in this chapter, don’t worry! Throughout this book, we’ll elaborate on why and how we apply these and other practices in the context of building ML products.

The first discipline: Product

Without the product discipline, no amount of expertise in the other disciplines (e.g., ML, data, software engineering) can help a team deliver ML products effectively. When we don’t understand users’ needs and the organization’s business model, it makes it hard to gain alignment from business to get started. Even when teams do get started, the lack of a product-oriented approach can leave them in a vacuum of product knowledge that is quickly filled with unsubstantiated assumptions, which tends to lead to teams over-engineering unvalidated features, and ultimately wasting valuable energy and resources.

Without understanding the business model and customer needs, it’s easy to lose momentum and direction. In contrast, with a product-oriented approach, ML teams can start with the end in mind, continuously test their assumptions, and ensure they are building solutions that are relevant to the needs of their customers.

With the Lean mindset, we recognize that all our ideas are based on assumptions that need to be tested and that many of these assumptions may be proven wrong. Lean provides a set of principles and practices to test our hypotheses, for example through prototype testing, safe-to-fail experiments, and build-measure-learn cycles, among others. Each experiment provides learnings that help us make informed decisions to persevere, pivot, or stop. By pivoting or ditching bad ideas early on, we can save time and resources and focus on ideas that will bring value to customers. Lean helps us move more quickly and “execute on opportunities by building the right thing at the right time and stop wasting people’s time on ideas that are not valuable.”⁴

As Henrik Kniberg of Spotify puts it: “Product development isn’t easy. In fact, most product development efforts fail, and the most common reason for failure is building the wrong product.”⁵ The goal here is not to avoid failure, but to fail more quickly and safely by creating fast feedback loops for building empathy and for learning. Let’s look at some practices that can help us achieve that.

Discovery

Discovery is a set of activities that helps us better understand the problem, the opportunity, and potential solutions. It provides a structure for navigating uncertainty through rapid, time-boxed, iterative activities that involve various stakeholders and customers. As eloquently articulated in Lean Enterprise (O’Reilly), the process of creating a shared vision always starts with clearly defining the problem, because having a clear problem statement helps the team focus on what is important and ignore distractions.

Discovery makes extensive use of visual artifacts to canvas, externalize, debate, test, and evolve ideas. Some useful visual ideation canvases include the Lean Canvas and Value Proposition Canvas, and there are many others. During discovery, we intentionally put customers and the business at the center and create ample space for the voice of the customer—gathered through activities such as user journey mapping, contextual enquiry, customer interviews, among others—as we formulate and test hypotheses about the problem/solution fit and product/market fit of our ideas.

In the context of ML, Discovery techniques help us assess the value and feasibility of candidate solutions early on so that we can go into delivery with grounded confidence. One helpful tool in this regard is the Data Product Canvas, which provides a framework for connecting the dots between data collection, ML, and value creation. It’s also important to use Discovery to articulate measures of success—and get alignment and agreement among stakeholders—for how we’d evaluate the fitness-for-purpose of candidate solutions.

Lean Enterprise has an excellent chapter on Discovery, and we would encourage you to read it for an in-depth understanding of how you can structure and facilitate Discovery workshops in your organization. Discovery is also not a one-and-done activity—the principles can be practiced continuously as we build, measure, and learn our way toward building products that customers value.

Prototype testing

Have you heard of the parable of the ceramic pots?⁶ In this parable, a ceramic pottery teacher tasked half of the class to create the best pot possible but only create one each. The other half of the class was instructed to make as many pots as possible within the same time frame. At the end of it, the latter group—which had the benefit of iteratively developing many prototypes—produced the higher-quality pots.

Prototypes allow us to rapidly test our ideas with users in a cost-effective way and allow us to validate—or invalidate—our assumptions and hypotheses. They can be as simple as “hand-sketched” drawings of an interface that users would interact with, or they can be clickable interactive mockups. In some cases, we may even opt for “Wizard of Oz” prototypes, which is a real working product, but with all product functions carried out manually behind the scenes, unbeknownst to the person using the product.⁷ (It’s important to note that “Wizard of Oz” is for prototype testing, not for running production systems. This misapplication, which was termed blatantly as “artificial artificial intelligence”, involves unscalable human effort to solve problems that AI can’t solve.)

Whichever method you pick, prototype testing is especially useful in ML product delivery because we can get feedback from users before any costly investments in data, ML, and MLOps. Prototype testing helps us shorten our feedback loop from weeks or months (time spent on engineering effort in data, ML, and MLOps) to days. Talk about fast feedback!

The second discipline: Delivery

If the product discipline is concerned with what we build and why, the delivery discipline speaks to how we execute our ideas. The mechanics of delivering an ML product involve multiple disciplines: delivery planning, engineering, product, ML, security, data, and so on. We use the term delivery here to refer to the delivery planning aspects of how we build ML solutions.

The delivery discipline focuses primarily on the shaping, sizing, and sequencing of work in three horizons (from near to far): user stories or features, iterations, and releases. It also pertains to how our teams operate and encompasses:

Team shapes
Ways of working (e.g., standups and retrospectives)
Team health (e.g., morale and psychological safety)
Delivery risk management

Lean recognizes that talent is an organization’s most valuable asset, and the delivery discipline reinforces that belief by creating structures that minimize impediments in our systems of work and amplify each teammate’s contributions and collective ownership. When done right, delivery practices can help us reduce waste and improve the flow of value.

Delivery is an often overlooked but highly critical aspect of building ML products. If we get all the other disciplines right but neglect delivery, we will likely be unable to deliver our ML product to users in a timely and reliable manner (we will explain why in a moment). This can lead to decreased customer satisfaction, eroded competitiveness, missed opportunities, and ultimately, failure to achieve the desired business outcomes.

Let’s look at some fundamental delivery practices.

Vertically sliced work

A common pitfall in ML delivery is the horizontal slicing of work, where we sequentially deliver functional layers of a technical solution—e.g., data lake, ML platform, ML models, UX interfaces—from the bottom-up. This is a risky delivery approach because customers can only experience the product and provide valuable feedback after months and even years of significant engineering investment. In addition, horizontal slicing naturally leads to late integration issues when horizontal slices come together, increasing the risk of release delays.

To mitigate this, we can slice work and stories vertically. A vertically sliced story refers to a story that is defined as an independently shippable unit of value, which contains all of the necessary functionality from the user-facing aspects (e.g., a frontend) to the more backend-ish aspects (e.g., data pipelines, ML models). Your definition of “user-facing” will differ depending on who your users are. For example, if you are a platform team delivering an ML platform product for data scientists, the user-facing component may be a command-line tool instead of a frontend application.

The principle of vertical slicing applies more broadly beyond individual features as well. This is what vertical slicing looks like, in the three horizons of the delivery discipline:

At the level of a story, we articulate and demonstrate business value in each story.
At the level of an iteration, we plan and prioritize stories that cohere to achieve a tangible outcome.
At the level of a release, we plan, sequence, and prioritize a collection of stories that is focused on creating demonstrable business value.

What If the Minimal Vertical Slice Isn’t Good Enough?

At this point, you may ask, what if the minimal vertical slice can’t meet the expected model performance for production release? For example, if we were training an inventory supply forecast model, the model’s predictions will inform warehousing decisions and erroneous forecasts can potentially cost millions of dollars in actual cost and revenue.

This is a situation that all ML practitioners will inevitably face, and several techniques can help us reduce the risk and cost of this scenario. For example, when we do prototype testing, as described earlier in product discipline, training prototype models helps us test the feasibility and viability of an ML approach to solving the business problem. Doing this as early and as rapidly as possible during Discovery helps us avoid wasting weeks and months of delivery effort in building a vertical slice that may end up in a dead end.

We can also apply the technique of Framing ML Problems, where we work with relevant customers or stakeholders to find the suitable responsibility boundary for an ML system. As we run experiments to test ideas, we generate learnings that tell us where and how ML is (or isn’t) suitable for solving the business problem. We describe Framing ML Problems in greater detail in the later section on the ML discipline.

Vertically sliced teams, or cross-functional teams

Another common pitfall in ML delivery is splitting teams by function, for example by having data science, data engineering, and product engineering in separate teams. This structure leads to two main problems. First, teams inevitably get caught in backlog coupling, which is the scenario where one team depends on another team to deliver a feature. In one informal analysis, backlog coupling increased the time to complete a task by an average of 10 to 12 times.

The second problem is the manifestation of Conway’s Law, which is the phenomenon where teams design systems and software that mirror their communication structure. For example, we have seen a case where two teams working on the same product built two different solutions to solve the same problem of serving model inferences at low latency. That is Conway’s Law at work. The path of least resistance steers teams toward finding local optimizations rather than coordinating shared functionality.

We can mitigate these problems for a given product by identifying the capabilities that naturally cohere for the product and building a cross-functional team around the product—from the frontend elements (e.g., experience design, UI design) to backend elements (e.g., ML, MLOps, data engineering). This practice of building multidisciplinary teams has sometimes been described as the Inverse Conway Maneuver. This brings four major benefits:

Improves speed and quality of decision making

The shared context and cadence reduces the friction of discussing and iterating on all things (e.g., design decisions, prioritization calls, assumptions to validate). Instead of having to coordinate a meeting between multiple teams, we can just discuss an issue using a given team’s communication channels (e.g., standup, huddles, chat channels).

Reduces back-and-forth handoffs and waiting

If the slicing is done right, the cross-functional team should be autonomous—that means the team is empowered to design and deliver features and end-to-end functionality without depending on or waiting on another team.

Reduces blind spots through diversity

Having a diverse team with different capabilities and perspectives can help ensure that the ML project is well-rounded and takes into account all of the relevant considerations. For example, an UX designer could create prototypes to test and refine ideas with customers before we invest significant engineering effort in ML.

Reduces batch size

Working in smaller batches has many benefits and is one of the core principles of Lean software delivery. As described in Donald Reinertsen’s Principles of Product Development Flow (Celeritas), smaller batches enable faster feedback, lower risk, less waste, and higher quality.

The first three benefits of cross-functional teams—improved communication and collaboration, minimized handoffs, diverse expertise—enable teams to reduce batch size. For example, instead of needing to engineer and silver plate a feature before it can be shared more widely for feedback, a cross-functional team would contain the necessary product and domain knowledge to provide that feedback (or, if not, they would at least know how to devise cost-effective ways to find the answers).

Cross-functional teams are not free from problems either. There is a risk that each product team develops its own idiosyncratic solution to problems that occur repeatedly across products. We think, however, with the right engineering practices, that this is a higher quality problem than the poor flow that results from functionally siloed teams. Additionally, there are mitigations to help align product teams including communities of practice, platform teams, and so on. We’ll discuss these in depth in Chapter 11.

Contrarily, we have seen functionally specialized teams deliver effectively in collaboration, given the institution of strong agile program management to provide a clear, current picture of end-to-end delivery and product operations, and collective guidelines for working sustainably to improve overall system health.

There is no one-size-fits-all team shape and the right team shapes and interaction modes for your organization depend on many factors, which will evolve over time. In Chapter 11, we discuss varied team shapes and how the principles of Team Topologies can help you identify suitable team shapes and interaction modes for ML teams in your organization.

Ways of Working

Ways of Working (WoW) refers to the processes, practices, and tools that a team uses to deliver product features. It includes, but is not limited to, agile ceremonies (e.g., standups, retros, feedback), user story workflow (e.g., Kanban, story kickoffs, pair programming, desk checks⁸), and quality assurance (e.g., automated testing, manual testing, “stopping the line” when defects occur).

One common trap that teams fall into is to follow the form but miss out on the substance or intent of these WoW practices. When we don’t understand and practice WoW in a coherent whole, it can often be counterproductive. For example, teams could run standups, but miss out on the intent of making work visible as teammates hide behind generic updates (“I worked on X yesterday and will continue working on it today”). Instead, each of these WoW practices should help the team have context-rich information (e.g., “I’m getting stuck in Y” and “Oh, I’ve faced that recently and I know a way to help you.”). This improves shared understanding, creates alignment, and provides each team member with information that improves their flow of value.

Measuring delivery metrics

One often-overlooked practice—even in agile teams—is capturing delivery metrics (e.g., iteration velocity, cycle time, defect rates) over time. If we think of the team as a production line (producing creative solutions, and not cookie cutter widgets), these metrics can help us regularly monitor delivery health and raise flags when we’re veering off track from the delivery plan or timelines.

Teams can and should also measure software delivery performance with the four key metrics: delivery lead time, deployment frequency, mean time to recovery (MTTR), and change failure rate. In Accelerate (IT Revolution Press), which is based on four years of research and statistical analysis on technology organizations, the authors found that software delivery performance (as measured by the four key metrics) correlated with an organization’s business outcomes and financial performance. Measuring the four key metrics helps us ensure a steady and high-quality flow in our production line.

The objective nature of these metrics helps to ground planning conversations in data and help the team actually see (in quantitative estimates) the work ahead and how well they are tracking toward their target. In an ideal environment, these metrics would be used purely for continuous improvement to help us improve our production line over time and meet our product delivery goals.

However, in other less-than-ideal environments, metrics can be misused, abused, gamed, and become ultimately dysfunctional. As Goodhart’s Law states, “when a measure becomes a target, it ceases to be a good measure.” Ensure that you’re measuring the right outcomes and continuously improving to find the appropriate metrics for your organization’s ML practice. We go into more detail on measuring team health metrics when we discuss the pitfalls of measuring productivity, and how to avoid them, in Chapter 10.

The third discipline: Engineering

Crucially, the rate at which we can learn, update our product or prototype based on feedback, and test again, is a powerful competitive advantage. This is the value proposition of Lean engineering practices.

Jez Humble, Joanne Molesky, and Barry O’Reilly in Lean Enterprise

All of the engineering practices we outline in this section focus on one thing: shortening feedback loops. The previous quote from Lean Enterprise articulates it well—an effective team is one that can rapidly make, test, and release the required changes—in code, data, or ML models.

Automated testing

In ML projects, it’s common to see heaps and heaps of code without automated tests. Without automated tests, changes become error-prone, tedious, and stressful. When we change one part of the codebase, the lack of tests forces us to take on the burden of manually testing the entire codebase to ensure that a change (e.g., in feature engineering logic) hasn’t caused a degradation (e.g., in model quality or API behavior in edge cases). This means an overwhelming amount of time, effort, and cognitive load is spent on non-ML work.

In contrast, comprehensive automated tests help teams to accelerate experimentation, reduce cognitive load, and get fast feedback. Automated tests give us fast feedback on changes and let us know whether everything is still working as expected. In practice, it can make a night-and-day difference in how quickly we can execute on our ideas and get stories done properly.

Effective teams are those that welcome and can respond to valuable changes in various aspects of a product: new business requirements, feature engineering strategies, modeling approaches, training data, among others. Automated tests enable such responsiveness and reliability in the face of these changes. We’ll introduce techniques for testing ML systems in Chapters 5 and 6.

Refactoring

The second law of thermodynamics tells us that the universe tends toward disorder, or entropy. Our codebases—ML or otherwise—are no exception. With every “quick hack” and every feature delivered without conscious effort to minimize entropy, the codebase grows more convoluted and brittle. This makes the code increasingly hard to understand and, consequently, modifying code becomes painful and error-prone.

ML projects that lack automated tests are especially susceptible to exponential complexity because, without automated tests, refactoring can be tedious to test and is highly risky. Consequently, refactoring becomes a significant undertaking that gets relegated to the backlog graveyard. As a result, we create a vicious cycle for ourselves and it becomes increasingly difficult for ML practitioners to evolve their ML solutions.

In an effective team, refactoring is something that is so safe and easy to do that we can do some of it as part of feature delivery, not as an afterthought. Such teams are typically able to do this for three reasons:

They have comprehensive tests that give them fast feedback on whether a refactoring preserved behavior.
They’ve configured their code editor and leveraged the ability of modern code editors to execute refactoring actions (e.g., rename variables, extract function, change signature).
The amount of technical debt and/or workload is at a healthy level. Instead of feeling crushed by pressure, they have the capacity to refactor where necessary as part of feature delivery to improve the readability and quality of the codebase.

Code editor effectiveness

As alluded to in the previous point, modern code editors have many powerful features that can help contributors write code more effectively. The code editor can take care of low-level details so that our cognitive capacity remains available for solving higher-level problems.

For example, instead of renaming variables through a manual search and replace, the code editor can rename all references to a variable in one shortcut. Instead of manually searching for the syntax to import a function (e.g., cross_val_score()), we can hit a shortcut and the IDE can automatically import the function for us.

When configured properly, the code editor becomes a powerful assistant (even without AI coding technologies) and can allow us to execute our ideas, solve problems, and deliver value more effectively.

Continuous delivery for ML

Wouldn’t it be great if there was a way to help ML practitioners reduce toil, speed up experimentation, and build high-quality products? Well, that’s exactly what continuous delivery for ML (CD4ML) helps teams do. CD4ML is the application of continuous delivery principles and practices to ML projects. It enables teams to shorten feedback loops and establish quality controls to ensure that software and ML models are high quality and can be safely and efficiently deployed to production.

Research from Accelerate shows that continuous delivery practices help organizations achieve better technical and business performance by enabling teams to reliably deliver value and to nimbly respond to changes in market demands. This is corroborated by our experience working with ML teams. CD4ML has helped us improve our velocity, responsiveness, cognitive load, satisfaction, and product quality.

We’ll explore CD4ML in detail in Chapter 9. For now, here’s a preview of its technical components (see Figure 1-5):

Reproducible model training, evaluation, and experimentation
Model serving
Testing and quality assurance
Model deployment
Model monitoring and observability

The fourth discipline: ML

The ML discipline involves more than knowing how to train, select, improve, deploy, and consume ML models. It also encompasses competencies such as ML problem framing, ML systems design, designing for explainability, reliability, Responsible AI, and ML governance, among other things.

Framing ML problems

In early and exploratory phases of ML projects, it’s usually unclear what problem we should be solving, who we are solving it for, and most importantly, why we should solve it. In addition, it may not be clear what ML paradigm or model architectures can help us—or even what data we have or need—to solve the problem. That is why it is important to frame ML problems, to structure and execute ideas, and to validate hypotheses with the relevant customers or stakeholders. The saying “a problem well-defined is a problem half-solved” resonates well in this context.

There are various tools that can help us frame ML problems, such as the Data Product Canvas, which we referenced earlier in this chapter. Another tool to help us articulate and test our ideas in rapid cycles and keep track of learnings over time is the Hypothesis Canvas (see Figure 1-6).⁹ The Hypothesis Canvas helps us in formulating testable hypotheses, in articulating why an idea might be valuable and who will benefit from it, and in steering us toward measuring objective metrics to validate or invalidate ideas. It is yet another way to shorten feedback loops by running targeted, timeboxed experiments. We’ll keep our discussion short here, as we’ll discuss these canvases in detail in the next chapter.

ML systems design

There are many parts to designing ML systems, such as collecting and processing the data needed by the model, selecting the appropriate ML approach, evaluating the performance of the model, considering access patterns and scalability requirements, understanding ML failure modes, and identifying model-centric and data-centric strategies for iteratively improving the model.

There is a great book that has been written on this topic, Designing Machine Learning Systems by Chip Huyen (O’Reilly), and we encourage you to read it if you haven’t already done so. Given that there’s great literature on this topic, our book won’t go into details of concepts already covered in Designing ML Systems.

Responsible AI and ML governance

MIT Sloan Management Review has a succinct and practical definition of Responsible AI:

A framework with principles, policies, tools, and processes to ensure that AI systems are developed and operated in the service of good for individuals and society while still achieving transformative business impact.

In MIT Sloan’s “2022 Responsible AI Global Executive Report”, it found that while AI initiatives are surging, Responsible AI is lagging. Of the companies surveyed, 52% are engaged in some Responsible AI practices, but 79% state that their implementations are limited in scale and scope. While they recognize that Responsible AI is crucial for addressing AI risks, such as safety, bias, fairness, and privacy issues, they admit to neglecting its prioritization. This gap increases the chances of negative consequences for their customers and exposes the business to regulatory, financial, and customer satisfaction risks.

If Responsible AI is the proverbial mountaintop, teams often fail to get there with only a compass. They also need a map, paths, guidance, and means of transport. This is where ML governance comes in as a key mechanism that teams can use to achieve Responsible AI objectives, among other objectives of ML teams.

ML governance involves a wide range of processes, policies, and practices aimed at helping practitioners deliver ML products responsibly and reliably. It spans the ML delivery lifecycle, playing a role in each of the following stages:

Model development: Guidelines, best practices and golden paths for developing, testing, documenting, and deploying ML models
Model evaluation: Methods for assessing model performance, identifying biases, and ensuring fairness before deployment
Monitoring and feedback loops: Systems to continuously monitor model performance, gather user feedback, and improve models
Mitigation strategies: Approaches to identify and mitigate biases in data and algorithms, to avoid negative and unfair outcomes
Explainability: Techniques and tools to explain a model’s behaviors under certain scenarios in order to improve transparency, build user trust, and facilitate error analysis
Accountability: Well-defined roles, responsibilities, and lines of authority; multidisciplinary teams capable of managing ML systems and risk-management processes
Regulatory compliance: Adherence to legal and industry-specific regulations or audit requirements regarding the use of data and ML
Data-handling policies: Guidelines for collecting, storing, and processing data to ensure data privacy and security
User consent and privacy protection: Measures to obtain informed consent from users and safeguard their privacy
Ethical guidelines: Principles to guide ML development and use, considering social impact, human values, potential risks, and possibilities of harm

While “governance” typically has bureaucratic connotations, we’ll demonstrate in Chapter 9 that ML governance can be implemented in a lean and lightweight fashion. In our experience, continuous delivery and Lean engineering complement governance by establishing safe-to-fail zones and feedback mechanisms. Taken together, not only do these governance practices help teams reduce risk and avoid negative consequences, they also help teams innovate and deliver value.

In Chapter 9, we will also share other helpful resources for ML governance, such as the “Responsible Tech Playbook” and Google Model Cards.

The fifth discipline: Data

As many ML practitioners know, the quality of our ML models depends on the quality of our data. If the data in our training sample is biased (as compared to the distribution of the population dataset), then the model will learn and perpetuate the bias. As eloquently put, “when today’s technology relies on yesterday’s data, it will simply mirror our past mistakes and biases.”¹⁰

To deliver better ML solutions, teams can consider the following practices in the data discipline.

Closing the data collection loop

As we train and deploy models, our ML system design should also take into consideration how we will collect and curate the model’s predictions in production, so that we can label them and grow high-quality ground truth for evaluating and retraining models.

Labeling can be a tedious activity and is often the bottleneck. If so, we can also consider how to scale labeling through techniques such as active learning, self-supervised learning, and weak supervision. If natural labels—ground truth labels that can be automatically evaluated or partially evaluated—are available for our ML task, we should also design software and data ingestion pipelines that stream in the natural labels as they become available alongside the associated features for the given data points.

When collecting natural labels, we must also consider how to mitigate the risks of data poisoning attacks (more on this shortly) and dangerous runaway feedback loops, where the model’s biased predictions have an effect on the real world, which further entrenches the bias in the data and subsequent models.

Teams often focus on the last mile of ML delivery—with a skewed focus on getting a satisfactory model out of the door—and neglect to close the data collection loop in preparation for the next leg and cycle of model improvement. When this happens, they forgo the opportunity to improve ML models through data-centric approaches.

Let’s look at the final practice for this chapter: data security and privacy.

Data security and privacy

As mentioned earlier in this chapter, data security and privacy are cross-cutting concerns that should be the responsibility of everyone in the organization, from product teams to data engineering teams and every team in between. An organization can safeguard data by practicing defense in depth, where multiple layers of security controls are placed throughout a system. For example, in addition to storing data securely in transit and at rest through the use of encryption and access controls, teams can also apply the principle of least privilege and ensure that only authorized individuals and systems can access the data.

At an organizational level, there must be data governance and management guidelines that define and enforce clear policies to guide how teams collect, store, and use data. This can help ensure that data is used ethically and in compliance with relevant laws and regulations.

Do These Practices Apply to Generative AI and Large Language Models?

As Generative AI and LLMs gained prominence in the collective awareness and product plans of many organizations, we and our colleagues have had the opportunity to work with organizations to conceptualize, develop, and deliver products that leverage Generative AI.¹¹

At the time of writing (late 2023), many Generative AI applications use LLMs for processing natural language. LLMs are a type of neural network based on the transformer architecture. There are other generative neural network architectures, including Generative Adversarial Networks (GANs) and Autoencoders. There are also other types of AI that are generative in nature, such as Bayesian methods.

LLMs typically generate text, where text may be natural language, or structured forms such as tables, JSON, or code. An LLM generates a stream of output tokens in response to some input tokens, which are also known as a “prompt.” Tokens represent some unit of the text input and output. For instance, the Llama 2 7B model, in response to a prompt with an open request, might generate output that starts with token #4587 “Of” followed by token #3236 “course,” and continue in that vein. LLMs may also be multimodal, processing and generating images, audio, and other modes of data, in which case the tokens may include patches or images, according to the modality.

As a neural network, an LLM is trained on many, many instances of inputs and their expected outputs using Stochastic Gradient Descent (SGD) to minimize a loss function representing the difference between the predicted and expected outputs. At each training iteration, SGD adjusts the weights of the neural network to reduce the loss function for subsequent predictions. While complex, multistage, and consuming large data volumes, LLM training is qualitatively the same thing as training other ML models discussed in this book. Therefore, this book is still a useful reference for teams training or fine-tuning LLMs.

Many applications integrating Generative AI will avoid the complexity of training or fine-tuning LLMs and simply use a pretrained LLM in inference mode as a flexible application component. An isolated LLM will have some base level of capability due to its training. When integrated into an application, an LLM can gain additional context in the prompt and/or multiple invocations can be orchestrated for better responses. These LLM capabilities are described as zero-shot, few-shot, and in-context learning, and the techniques for using these capabilities are called “prompt engineering.”

These application integration techniques expand the base capabilities of LLMs by providing access to additional data. LLMs generate content, but they can’t be relied on to recall information perfectly, suppress sensitive information, reason, plan, compute, detect malicious intent, or identify any of their limitations in these regards. This may require the application to also use traditional Natural Language Processing (NLP) and software application development techniques to constrain LLM inputs or outputs for better performance, robustness, or protection from threats. In general, these considerations make integrating LLMs complex.

As a result, whether integrating a pretrained model or training an in-house model, good software architecture principles and development practices as described in this book remain relevant, and are possibly even more important due to the need to accommodate the sometimes unpredictable nature of LLM responses.

Although Generative AI and LLMs have led to a paradigm shift in the methods we use to direct or restrict models to achieve specific functionalities, the fundamentals of Lean product delivery and engineering haven’t changed. In fact, the fundamental tools and techniques in this book have helped us—across aspects of product, delivery, ML, and engineering—to articulate and test hypotheses early on, iterate quickly, and deliver reliably. By drawing from techniques outlined in this chapter, we were able to reduce the time and cost to value even when dealing with the complexities inherent in Generative AI and LLMs.

In Chapter 2, we’ll briefly discuss how product discovery techniques have helped us, and can help you, sharpen your focus when shaping Generative AI product opportunities. In Chapter 6, we’ll go into more detail on testing strategies and techniques for LLM applications.

Give yourself some massive pats on the back because you’ve just covered a lot of ground on the interconnected disciplines that are essential for effectively delivering ML solutions!

Before we conclude this chapter, we’d like to highlight how these practices can serve as leading indicators for positive or undesirable outcomes. For example, if we don’t validate our product ideas with users early and often—we know how this movie ends—we are more likely to invest lots of time and effort into building the wrong product. If we don’t have cross-functional teams, we are going to experience backlog coupling as multiple teams coordinate and wait on each other to deliver a change to users.

This is not just anecdotal. In a scientific study on performance and effectiveness of technical businesses involving more than 2,800 organizations, the authors found that organizations that adopt practices such as continuous delivery, Lean, cross-functional teams, and generative cultures exhibit higher levels of performance—faster delivery of features, lower failure rates, and higher levels of employee satisfaction.¹² In other words, these practices can actually be predictors of an organization’s performance.

Conclusion

Let’s recap what we’ve covered in this chapter. We started by looking at common reasons for why ML projects fail, and we compared what ML delivery looks in both low- and high-effectiveness environments. We then looked through a systems thinking lens to identify the disciplines that are required for effective ML delivery. We looked at how Lean helps us reduce waste and maximize value. Finally, we took a whirlwind tour of practices in each of the five disciplines (product, delivery, software engineering, ML, and data) that can help us deliver ML solutions more effectively.

From our interactions with various ML or data science teams across multiple industries, we continue to see a gap between the world of ML and the world of Lean software delivery. While that gap has narrowed in certain pockets—where ML teams could deliver excellent ML product experiences by adopting the necessary product, delivery, and engineering practices—the gulf remains wide for many teams (you can look at Dana’s experience in the low-effectiveness environment earlier in this chapter for signs of this gulf).

To close this gap, the ML community requires a paradigm shift—a fundamental change in approach or underlying assumptions—to see that building an ML-driven product is not just an ML and data problem. It is first and foremost a product problem, which means to say it’s a product, engineering, and delivery problem—and it requires a holistic, multidisciplinary approach.

The good news is that you don’t have to boil the ocean or reinvent the wheel—in each discipline, there are principles and practices that have helped teams successfully deliver ML product experiences. In the remainder of this book, we will explore these principles and practices, and how they can improve our effectiveness in delivering ML solutions. We will slow down and elaborate on the principles and practices in a practical way, starting with product and delivery. There will be applicable practices, frameworks, and code samples that you can bring to your ML projects. We hope you’re strapped in and excited for the ride.

An Invitation to Journey with Us

We have covered a lot of ground in this chapter. Depending on where you are on your journey and your experience, you may feel like the desired state that we’ve painted is insurmountable. Or you may feel excited that others have felt your pains and challenges, and that there’s a better path.

Wherever you find yourself on this continuum, we hope that you’ll take this book as an invitation. An invitation to adopt a beginner’s mindset to explore how the five disciplines can improve how ML practitioners deliver ML solutions. In our experience, each discipline comprises composable techniques that teams can use today to complement your existing capabilities—be it in ML, product, or software engineering.

This book is also an invitation to reflect on your team’s or organization’s ML projects, to notice areas of value and areas of waste. Where there is waste, we hope that the principles and practices we lay out in this book will help you find shorter and more reliable paths to your desired destination. They have certainly helped us (which is why we decided to write a book about this!) and they are principles and practices that we continue to bring to our ML projects.

We acknowledge that it takes more than willpower and good practices to effect change. It requires some level of organizational alignment, a conducive culture, and good leadership, among other factors (which we’ll elaborate on in Chapters 10 and 11). This book is written with the belief that teams, empowered with practical knowledge on how to effectively deliver ML solutions, can iterate toward better ways of doing things, deliver impactful outcomes, and effect and inspire change in their organization.

¹ It’s worth noting that identifying the wrong customer problem to solve is not unique to ML, and any product is susceptible to this.

² As this Gartner survey is a small survey comprising only 200 people, there’s likely to be high variance in the number of ML projects that never got delivered across regions, industries, and companies. Take the specific number with a dash of salt and try to relate it to your qualitative experience. Have you personally experienced or heard of ML projects that, even after months of investment, were never shipped to users?

³ Lean helpfully provides a nuanced classification of waste, also known as the “eight deadly wastes”, which enumerate common inefficiencies that can occur in the process of delivering value to customers. The three examples in this paragraph refer to overproduction, waiting, and defects, respectively. The remaining five types of waste are: transport, overprocessing, inventory, motion, and under-utilized talent.

⁴ Jez Humble, Joanne Molesky, and Barry O’Reilly, Lean Enterprise (Sebastopol: O’Reilly, 2014).

⁵ Humble et al., Lean Enterprise.

⁶ This parable was first told in David Bayles and Ted Orland’s book, Art & Fear (Image Continuum Press), and is based on an actual fact, with the only difference being that the subject was photographs instead of ceramic pots. The teacher in the true story was Ted Orland, who was an assistant to Ansel Adams, the renowned American photographer and environmentalist.

⁷ Jeremy Jordan has written an excellent in-depth article describing how we can prototype and iterate on the user experience using design tools to communicate possible solutions.

⁸ A desk check refers to the practice of having a short (e.g., 15-minute) huddle with the team when a pair believes the development work for a feature is complete. Not everyone has to be there, but it helps to have the product, engineering, and quality lens at the desk check. We find that having a brief walk-through of the definition of done, and how the pair delivered the feature can invite a focused and open discussion. It also saves team members from multiple instances of context-switching and waiting in a long-drawn back-and-forth conversation on a chat group.

⁹ The word “hypothesis” in this context is technically different, but conceptually similar, to how it’s defined in statistics. In this context, a hypothesis is a testable assumption, and it is used as a starting point for iterative experimentation and testing to determine the most effective solution to the problem.

¹⁰ Patrick K. Lin, Machine See, Machine Do: How Technology Mirrors Bias in Our Criminal Justice System (Potomac, MD: New Degree Press, 2021).

¹¹ When we talk about Generative AI in the context of effective ML teams, we’re not talking about the use of generic chatbots or new productivity tools to help software delivery teams write code or user stories. We are talking about ML teams that are playing a role in building new systems that incorporate Generative AI technology.

¹² Nicole Forsgren, Jez Humble, and Gene Kim, Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations (Upper Saddle River, NJ: Addison-Wesley, 2018).

Get Effective Machine Learning Teams now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial