Peek and conclude experiments adaptively without inflating the false-positive rate
Do you estimate the effect size and calculate the sample size before running your A/B test? Do you check the p-value only once when you conclude the A/B test? When the p-value is 5.1% (0.1% higher than the threshold 5%) and the pre-calculated sample size has been reached, instead of running the experiment longer, do you stop the experiment and conclude that the product feature is not improving business? If your answer to any of those questions is no, chances are the false-positive rate (Type I error rate) of your experiments is inflated.
The false-positive rate is inflated when experimenters peek and make decisions based on observed results before the experiment reaches the pre-calculated sample size. With fixed sample size testing (e.g., t-test), it is best practice to estimate the desired sample size before launching the experiment and conclude that once the sample size has been reached, but many practitioners don’t follow this in practice.
Is the best solution to ask the experimenter to get a statistics degree and follow the procedure faithfully? We think the experimenters have strong reasons to peek, and peeking should be allowed. In this article, we introduce our study on the peeking problem and how we implemented the always valid p-value at Wish to control the false-positive rate while enabling experimenters to conclude their experiments adaptively.
We evaluate the false-positive rates by running 1,000 simulated A/A tests with and without peeking. Each A/A test is constructed using simulated bucket assignments and real metric data (one month of gross merchandise value data). In each A/A test, we apply Welch’s t-test with the null hypothesis — there is no significant difference between the two experiment buckets. Any detected difference is a false positive since, in A/A tests, the two experiment buckets are from the same population.
Without peeking (i.e., making conclusions at the same sample size for all 1,000 A/A tests), the false positive rate was about 4.7%. In contrast, if we were to conclude the experiment the first time the p-value was below the threshold 5%, the false-positive rate was about 21%.
Even we were more conservative and set the decision rule as concluding an experiment when the p-value was below 5% over consecutive n days. The false-positive rate was still high. The table below demonstrates the false positive rates corresponding to different values of n.
A/B test platforms typically leverage fixed sample size testing (e.g., t-test), where the required sample size is estimated in advance. The fixed sample size testing optimally trades off the false positive rate and the probability of detecting true positives (power) only if the experimenters conclude the experiment only once after the pre-calculated sample size has been reached.
False-positive rate increases substantially when concluding experiments adaptively, i.e., concluding an experiment when it becomes statistically significant. The figure below depicts the trajectory of the p-value over time for a typical A/A test. We apply Welch’s t-test for each day of the one-month A/A experiment. It is not rare that the p-value would fall below the red line (the decision threshold p-value < 0.05) several times over the whole course of an A/A experiment. Furthermore, if we run the experiment indefinitely and conclude the experiment as soon as the p-value is 5%, the false positive rate goes to 100%. That is, the p-value always drops below 5% at some point due to chance.
If peeking substantially inflates false-positive rates, why don’t experimenters always calculate the required sample size and conclude the experiment only when the pre-calculated sample size has been reached?
Sample size calculation is hard and inaccurate
It is hard and at least inaccurate to calculate sample size before running the experiment since it requires the experimenter to estimate the effect size — how much impact the new product feature makes. Before running the experiment, the experimenter typically doesn’t have a good knowledge of the impact her new product feature has. Even the experimenter has extensive domain knowledge about the metric her new product feature targets to move and can estimate the effect size from previous experiments, the calculated sample size can be vastly larger if she underestimates the effect size since the desired sample size is quadratic in the effect size. For example, the experimenter would wait four (or 100) times longer when the estimated effect size was half (or 10%) of the true effect size.
Experimenters have good reasons to peek
Although peeking inflates false-positive rates, experimenters have strong incentives to peek. Because 1) data comes in continuously for online experiments that run at tech companies, and 2) concluding the experiments as soon as possible benefits the customers and the company.
In modern online experiments, experimenters observe data in real time, which enables the experiment to adjust the sample size adaptively (known as adaptive sample size testing). Data scientists are frequently asked by business partners how long to wait until the experiment reaches statistical significance, and data scientists can wait longer or stop sooner to adjust the sample size. This is impossible for agriculture studies, for which most experiment design theories were developed 100 years ago. In agriculture experiments, even if the experimenter wants to peek, she has no choice but to wait until the end of the crop season to collect data.
Experimenters want to conclude the experiments as soon as they can to improve or stop hurting user experience. With fixed sample size testing, experimenters should not roll out a beneficial feature to all the users before the sample size has reached even though the experiment results have already suggested significant improvement. Otherwise, the fixed sample size testing (e.g., t-test) leads to a high positive rate. Vice versa, if the experimenter does not want to mistakenly kill a product feature, she cannot stop the experiment even if the early experiment results show significant degradation.
Instead of asking all practitioners, who may not necessarily have rigorous statistical training, to follow the best practice of experiment design, which sometimes is counter-intuitive, we provide practitioners at Wish always valid p-values. This allows us to conclude an experiment as soon as the experiment results show a significant positive or negative impact.
We implemented the always valid p-value in the internal experimentation platform at Wish. A great advantage of the always valid p-value is that the end-users do not see any differences in the experimentation UI and can interpret the always-valid p-values the same way as they do with p-values resulting from the fixed sample size testings. In addition, when data scientists launch their experiments, they do not need to bang their heads against the wall to figure out what the desired sample size is.
If you want to know the mathematical details of how this approach works, please read through the rest of this section. If you are just interested in using this approach, you can skip the rest of this section without affecting your understanding.
A key message here is that the false-positive rate of your experiment is well controlled even when you peek — conclude the experiment as soon as you see the p-value below the threshold (e.g., 5%).
To have p-values that are always valid no matter when the experimenters choose to stop the experiment, we need to ensure
That is, under the null hypothesis θ₀, the probability of p-value smaller than the threshold s, which ranges from 0 to 1, is smaller than s at any time T. This implies that even though p-values are viewed at a time that is data dependent, false-positive rates are still below the pre-defined threshold s. For example, if we set the threshold at 5%, we can conclude the experiment as soon as the always-valid p-value is below 5%, and the FPR is below 5%.
The always valid p-value approach leverages sequential testing, which is applied in clinical research with data that arrive sequentially over time. Ramesh Johari et. al introduced a type of sequential testing — the mixture sequential probability ratio test (mSPRT) that applies to A/B testing to enable sequential decisions without inflating false-positive rate. In essence, the always valid p-value is calculated as 1/Λ, where Λ is computed as follows
Here, Λ represents the ratio of the likelihood of data following the null hypothesis over the likelihood of data following a mixture of alternative hypotheses, and h(θ)is the mixture distribution of θ under the alternative hypothesis. Intuitively, the larger the Λ, the more likely data is from the null distribution.
It can be proven that Λ is a martingale under the null hypothesis, and therefore,
Hence 1/Λ follows the definition of always valid p-values.
Specifically, is estimated as follows
- Y̅ₙ is the sample mean of the treatment bucket and X̅ₙis the sample mean of the control bucket and when the sample size is n
- Vₙ is the sum of the sample variance control bucket and the sample size of the treatment bucket.
Zhenyu Zhao et. al have shown that τ in the mixing distribution h(θ) can be estimated as follows:
where z is the Z statistic, α is the significance level, and v_ctrl and v_trt are the sample variances of control and treatment bucket respectively.
We evaluated the false-positive rates of the always valid p-value approach by running A/A tests. In each A/A test, we compute the always valid p-values after the arrival of each data point, and we conclude the experiment either when the always valid p-value is below 5% or the experiment has reached a predetermined sample size, which is much larger than the required sample size of any typical experiment. Our A/A study showed that the false positive rate of the always valid p-value approach is below 5%.
We also studied the true positive rate (power) of the always valid p-value approach as follows:
- Simulate two weeks of data, where the mean of the treatment bucket is 0.1% larger than the mean of the control bucket
- Compute p-values using t-test and the always-valid p-value approach after the arrival of each day’s data.
- We declare detection of a difference if the p-value resulting from Welch’s t-test is smaller than 5% for any of the 14 days or the always-valid p-value of the last day is smaller than 5%.
- Repeat step 1 to step 4 10,000 times
- Increase the difference from 0.1% to a larger value, then repeat step 1 to step 4.
We tested different effect sizes (ranging from 0.1% to 1%) and compared the power of the always-valid approach and the one of t-test. Note this is not a fair comparison since the t-test has a higher false-positive rate when applied sequentially. We plot the true positive rate of the two approaches over different effect sizes. The plot below suggests that even compared to t-test that has a high false-positive rate, the always-valid p-value approach leads to satisfactory power.
A caveat of the always-valid p-values is that they tend to be less reliable when buckets are highly imbalanced. In practice, at Wish, we apply the technique when two buckets are not too imbalanced (e.g., the ratio of treatment to control is not larger than 5).
Although peeking substantially inflates false-positive rates, practitioners have good reasons to peek. To empower them to conclude the experiments as soon as possible while controlling the false positive rate, we implemented always valid p-values, which enables experimenters to adjust the experiment length (sample size) in real-time. Our simulation studies suggest that the always valid p-value approach controls the false positive rate and has satisfactory true positive rates.
Thanks to Eric Jia and Chao Qi for their contributions to this project. I am also grateful to Pai Liu, Pavel Kochetkov, and Lance Deng for their feedback on this post.
 Johari, Ramesh, et al. “Peeking at a/b tests: Why it matters, and what to do about it.” Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017. Zhao, Zhenyu, Mandie Liu, and Anirban Deb. “Safely and quickly deploying new features with a staged rollout framework using sequential test and adaptive experimental design.” 2018 3rd International Conference on Computational Intelligence and Applications (ICCIA). IEEE, 2018.