Chapter 4. Best Practices

In this chapter, we look at a few best practices to consider when building your experimentation platform.

Running A/A Tests

To ensure the validity and accuracy of your statistics engine, it is critical to run A/A tests. In an A/A test, both the treatment and control variants are served the same feature, confirming that the engine is statistically fair and that the implementation of the targeting and telemetry systems are unbiased.

When drawing random samples from the same distribution, as we do in an A/A test, the p-value for the difference in samples should be distributed evenly across all probabilities. After running a large number of A/A tests, the results should show a statistically significant difference exists at a rate that matches the platform’s established acceptable type I error rate (α).

Just as a sufficient sample size is needed to evaluate an experimental metric, so too does the evaluation of the experimentation platform require many A/A tests. If a single A/A test returns a false positive, it is unclear whether this is an error in the system or if you simply were unlucky. With a standard 5% α, a run of 100 A/A tests might see somewhere between 1 and 9 false positives without any cause for alarm.

There could be a number of reasons for failing the A/A test suite. For example, there could be an error in randomization, telemetry, or stats engine. Each of these components should have their debugging metrics to quickly pinpoint the source of failure.

On a practical level, consider having a dummy A/A test consistently running so that a degradation due to changes in the platform could be caught immediately. For a more in-depth discussion, refer to research paper from Yahoo! by Zhenyu Zhao et al.1

Understanding Power Dynamics

Power dynamics measures an experiment’s ability to detect an effect when there is an effect there to be detected. Formally, the power of an experiment is the probability of rejecting a false null hypothesis.

As an experiment is exposed to more users, it increases in power to detect a fixed difference in the metric. As a rule of thumb, you should fix the minimum detectable effect and power threshold for an experiment to derive the sample size—the number of users in the experiment. A good threshold for power is 80%. With a fixed minimum detectable effect, the experimentation platform can educate users how long—or how many more users—to wait before being confident in the results of the experiment.

An underpowered experiment might show no impact on a metric, when in reality there was a negative impact.

Executing an Optimal Ramp Strategy

Every successful experiment goes through a ramp process, starting at 0% of users in treatment and ending at 100%. Many teams struggle with the question how many steps are required in the ramp and how long should we spend on each step? Taking too many steps or taking too long at any step can slow down innovation. Taking big jumps or not spending enough time at each step can lead to suboptimal outcomes.

The experimentation team at LinkedIn has suggested a useful framework to answer this question.2 An experimentation platform is about making better product decisions. As such, they suggest the platform should balance three competing objectives:

Speed

How quickly can we determine whether an experiment was successful?

Quality

How do we quantify the impact of an experiment to make better trade-offs?

Risk

How do we reduce the possibility of bad user experience due to an experiment?

At LinkedIn, this is referred to as the SQR framework. The company envisions dividing a ramp into four distinct phases:

Debugging phase

This first phase of the ramp is aimed at reducing risk of obvious bugs or bad user experience. If there is a UI component, does it render the right way? Can the system take the load of the treatment traffic? Specifically, the goal of this phase is not to make a decision, but to limit risk; therefore, there is no need to wait at this phase to gain statistical significance. Ideally, a few quick ramps—to 1%, 5%, or 10% of users—each lasting a day, should be sufficient for debugging.

Maximum power ramp phase

After you are confident that the treatment is not risky, the goal shifts to decision making. The ideal next ramp step to facilitate quick and decisive decision making is the maximum power ramp (MPR). Xu and her coauthors suggest, “MPR is the ramp that gives the most statistical power to detect differences between treatment and control.” For a two-variant experiment (treatment and control), a 50/50 split of all users is the MPR. For a three-variant experiment (two treatments and control) MPR is a 33/33/34 split of all users. You should spend at least a week on this step of the ramp to collect enough data on treatment impact.

Scalability phase

The MPR phase informs us as to whether the experiment was successful. If it was, we can directly ramp to 100% of users. However, for most nontrivial scale of users, there might be concerns about the ability of your system to handle 100% of users in treatment. To resolve these operational scalability concerns, you can optionally ramp to 75% of users and stay there for one cycle of peak traffic to be confident your system will continue to perform well.

Learning phase

The experiment can be successful, but you might want to understand the long-term impact of treatment on users. For instance, if you are dealing with ads, did the experiment lead to long-term ad blindness? You can address these “learning” concerns by maintaining a hold-out set of 5% of users who are not given the treatment for a prolonged time period, at least a month. This hold-out set can be used to measure long-term impact, which is useful in some cases. The key is to have clear learning objectives, rather than keeping a hold-out set for hold-out’s sake.

The first two steps of this ramp are mandatory, the last two are optional. The MPR outlines an optimal path to ramping experiments.

Building Alerting and Automation

A robust platform can use the collected telemetry data to monitor for adverse changes.

By building in metrics thresholds, you can set limits within which the experimentation platform will detect anomalies and alert key stakeholders, not only identifying issues but attributing them to their source. What begins with metrics thresholds and alerting can quickly be coupled with autonomous actions allowing the experimentation platform to take action without human intervention. This action could be an automated kill (a reversion of the experiment to the control or safe state) in response to problems or an automatic step on ramp plan as long as the guardrail metrics lie within safe thresholds. By building metrics thresholds, alerting, and autonomous ramp plans, experimentation teams can test ideas, drive outcomes, and measure the impact faster.

1 Zhao, Zhenyu, et al. “Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation.” Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on. IEEE, 2016.

2 Xu, Ya, Weitao Duan, and Shaochen Huang. “SQR: Balancing Speed, Quality and Risk in Online Experiments.” arXiv:1801.08532

Get Understanding Experimentation Platforms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.