Chapter 4. From Continuous Delivery to Continuous Experimentation

In the introduction of this book, we touched upon the convergence of continuous delivery (CD) and experimentation driving “lean product development,” with feature flags being the foundational element powering this convergence.

Let’s explore the trends and practices that are driving this convergence.

CD became a well-defined strategy among forward-thinking engineering teams, and stemmed from the need for businesses to rapidly iterate on ideas. At the same time, product management teams were adopting lean product development concepts, such as customer feedback loops and A/B testing. They were motivated by a simple problem: up to 90% of the ideas they took to market failed to make a difference to the business. Given this glaring statistic, the only way to be an effective product management organization was to iterate fast and to let customer feedback inform investment decisions in ideas.

Common elements began to emerge, connecting both these trends in day-to-day software development and delivery. These elements included the need for rapid iteration, safe rollouts through gradual exposure of features, and telemetry to measure the impact of these features on customer experience. The resulting outcome is that modern product development teams are beginning to treat CD and experimentation (i.e., a more generic term for A/B testing) as two sides of the same coin. Core to both of these practices is the foundational technology of feature flags.

We can further illustrate this convergence through real-world examples of how teams at LinkedIn, Facebook, and Airbnb release every feature as an experiment and how every experiment is released through a flag. These teams have shown that the future of CD is to continuously experiment.

Furthermore, this convergence is now creating a need for tooling that can support this new paradigm of continuous feature experimentation.

Capabilities Your Flagging System Needs

In this section, we will cover some ways a feature flagging system can evolve to support experimentation.

Statistical Analysis of KPIs

A significant step in the evolution of a feature flagging system into an experimentation system is to tie feature flags to Key Performance Indicators (KPIs). This involves tracking user activity, building data ingestion pipelines, and investing in statistical analysis capabilities to measure KPIs within the treatment and control groups of an experiment (on or off for a feature flag). Statistically significant differences between the groups can be used to decide whether an experiment was successful and should continue ramping toward 100% of customers.

The anticipated outcome then becomes: ideas turned to products with speed from feature flags, and products turned to outcomes with analytics from experimentation.

Multivariate Flags

Feature flagging is a binary concept: a flag is either on or off. Similarly, experimentation is a binary concept. There is a treatment and a control. Treatment is the change we are testing, and control is the baseline for comparison. However, it is common for an experiment to compare multiple treatments against a control. For example, Facebook might want to experiment with multiple versions of its newsfeed ranking algorithm. You can enhance a feature flagging system to support this experimentation need by changing its interface from

if (flags.isOn("newsfeed-algorithm")) {
   // show the feature
} else {
   // do not show the feature
}

to:

treatment = flags.getTreatment("newsfeed-algorithm");
if (treatment == "v1") {
   // show v1 of newsfeed algorithm
} else if (treatment == "v2") {
   // show v2 of newsfeed algorithm
} else {
  // show control for newsfeed algorithm
}

Targeting

A simple feature flag is global in nature—it is either on or off for all users. However, experimentation requires more granular capabilities for targeting and ramping. On the targeting side, an experiment might need to be defined for a segment of customers for whom the feature will be turned on. Using the example of Facebook’s newsfeed, Facebook might want to experiment on a ranking algorithm for a particular group of users in a specific geographic location. To accommodate this need, a flagging system can evolve to accept customer targeting dimensions at runtime. This pseudocode will clarify:

treatment = flags.getTreatment("newsfeed-algorithm",
  {user: request.user, age:
"35", locale: "U.S"})

In an ideal implementation, the flagging system should abstract the details of the dimensions away from the developer so that the developer simply has to call the following:

treatment = flags.getTreatment("newsfeed-algorithm",
  {user: request.user})

Randomized Sampling

To infer causality between a feature experiment and changes in KPIs, we need a treatment and a control group. Treatment is the group exposed to the new feature or behavior; control is the group seeing baseline behavior. The only difference between these groups should be the feature itself. This concept is called control for biases. Using our recent Facebook example again, the treatment and control algorithms should both include teenagers from the United States. If the treatment algorithm is given to Australian teenagers while the control is given to men in the United States in their 30s, you cannot infer causality between the new algorithm and KPI changes because of the demographic differences between treatment and control.

You can use a feature flag to serve this need by adding the ability to randomly give a feature to a percentage of customers. As an example, Facebook would update its feature flag to serve the new algorithm to 50% of randomly selected users of the target age group and geographic location, and the control algorithm to the remaining 50%. This percentage rollout is called randomized sampling.

The key point here is randomization. If two different Facebook experiments are both at 50/50 exposure across the same segment of users, the 50 percent of users seeing the treatment for one experiment should not overlap—except by chance—with the remaining 50% seeing the treatment for the other experiment. This is possible only through randomization.

Without randomization, one experiment can bias the results of the other, nullifying any causality between the feature and changes in KPIs.

Version History

A feature flag’s historical state is usually unimportant. What matters is that the flag is either on or off at any given moment. Version history, however, is important for experiments. Let’s assume that we run a 30/70 experiment across all Facebook users (i.e., 30% in treatment and 70% in control). If we change the experiment to 50/50, any KPI impact measured in the 30/70 state is statistically invalid for the 50/50 state.

This means that a feature flagging system should keep a versioned history of changes to the configuration of a flag, so statistical analysis of the KPI impact can respect version boundaries. Practically speaking, you can achieve this can by pushing the version history of feature flags into the analytics system serving experimentation needs.

Get Managing Feature Flags now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.