Jun 28, 2022 6 min read

The Complexities of Experimentation in Reality

Source: https://blog.revolutionanalytics.com/2010/12/arguing-with-a-statistician.html

^Reference^{: This note details the second topic in the}^{7 broader topics of discussion in experimentation}^{. Further, the points listed in this blog will be later converted into their own blog posts.}

Dull and lifeless are towns that have an overdone symmetry where all the roads look the same. Tedious are days that pass by in a fixed repetitive routine when nothing unexpected happens. Mundane would be the patterns of the world had they been exactly how our statistical models define them to be. The world is random and much more chaotic than we expect it to be. And that is the beauty of it. The fact that we need to struggle through that randomness, I believe makes life interesting.

For instance, let us imagine driving through the mountains. From a distance, mountains look like cones distributed randomly through the landscape. But once you start climbing them you find them in different shapes, different sizes, and with different textures. When a scientist creates a model of reality, they trim off some part of the randomness in reality so that reality becomes tractable to study. For instance, suppose a mathematician is given a job to calculate the surface area of mountains. A relatively simple model will be that, "let's assume that the mountain is like a triangle, hence let's measure the base and the height and calculate its surface area". A better model will be "let's assume that the mountain is like a cone, hence let's measure the base radius and the height and calculate its surface area."

There are no right models to simplify randomness in the world. It is just that some models make good assumptions about the real world to make accurate predictions. When these assumptions are not met, the results of these models are flawed and deviate from the truth in the proportion of how starkly the assumptions are violated.

There is a famous quote among statisticians that says that all models are wrong, but some of them are useful. All models are trimmed down versions of reality based on the assumptions they make. An A/B test also makes assumptions about the reality. Whenever those assumptions are violated, the results of A/B tests suffer. One should know these assumptions so that they know when the results of experimentation need to be looked at in a different way or should not be trusted.

Assumptions behind A/B Testing

In this section, I discuss the assumptions about the experiment that exist behind all A/B Testing frameworks.

Stationarity Assumption: The stationarity assumption says that the underlying data generating process does not change with time. Hence, if the observed mean conversion rate is 10%, it is 10% across time. It does not become 15% on weekends and 5% on weekdays.
Independent and Identically Distributed Assumption: The IID assumption demands that the data points that are collected are not correlated with each other in any way and each sample is an independent draw from the sample. When samples are not independent, we often are not able to get a representative picture of the entire population.
Perfect Randomisation Assumption: Randomised Control Trials, in general, assume that the assignment of a visitor to control or treatment groups should be completely random and no biases should creep into the procedure of randomization.

Please do let me know if there is a crucial assumption that I am missing out in my list. All the assumptions when not met, lead the results to be skewed in all A/B tests. In this note, I give an overview of some of these real-world anomalies where A/B testing results get fudged up.

Interesting Nuances of Real-World Experiments

A/B Testing has brought experimentation to its apex because data collection has become easy with the emergence of the internet. In today's world, experimentation is being applied to many different use-cases and many of those use-cases do not meet the assumptions of A/B testing. These are common patterns that keen experimenters should know to better understand their experiments.

Weekday-Weekend Effect: For many online products, the user engagement varies through the day of the week and hence the stationarity assumption is compromised. For instance, an e-commerce website may see an increased sale towards the end of the week. In such cases, the day the experiment started matters heavily because the initial data collection is often biased. A common practice to average out any weekly cycles is to at least run a test for two weeks. These patterns can exist not just on weekly, but on monthly and seasonal levels as well.
Primacy/Novelty Effects: It is a known pattern among experimenters that when a new feature is introduced, sometimes people click more often on it just for the fact that it is new (novelty effects). Other times, people are used to the previous design and hence take time to get used to the new feature (primacy effects). Both these effects wear off in some time and in that period an experiment might generate a non-trustworthy result (false positives in case of novelty effect, false negative in case of primacy effects). This is because the stationarity assumption is not met in this case. Before trusting your decision it is always advised to break down the conversion rates by each day and observe if they have a dropping or rising pattern.
Twyman's Law: The law says that "Any figure that looks interesting or different is usually wrong". Whenever you see an extraordinary improvement or deterioration in goal metrics of an experiment, it is always advisable to check the entire experiment pipeline once again. While extraordinary large effects are theoretically possible, it is often the case that the likelihood of something being wrong with the experiment is much higher than the unusual impact of the tested idea. Twyman's law is a meta-level concept to always keep in mind so that you can double-check surprising results before implementing them.
One Factor at a Time (OFAT): There are many proponents of experimentation who advocate only making one change in one experiment, also called OFAT. On the other hand, there are other experimenters who want to speed up their work by combining and testing a set of changes at once. While both approaches are theoretically correct, the only problem is that if you see an impact of a bunch of changes together, it is hard to factor out which change had the most impact and which change was useless. Whether to go with OFAT or not is a personal call and none of these approaches are technically wrong.
Triggering: Sometimes in real-world experimentation, if the target audience is not carefully selected, you end up adding noise to your experimentation data. For instance, suppose an e-commerce website wants to test the impact of changing the minimum order requirement for free shipping. Suppose that earlier shipping charges were waived off after 35$ and now that threshold is proposed to be 25$. It must be noted that the only customers who genuinely see a change, in this case, are people with an order value between 25$ - 35$. Anyone else will practically not see a change in the website. Hence, for increased power of the test, tests should only be run on the impacted user set and not everyone. Note that, not implementing triggering will not impact the trustworthiness of the result but would require more data points to reach conclusion.
Observational Causal Studies: Sometimes in the real world, a controlled experiment cannot be implemented. For instance, if a firm wants to reduce the prices of its product and see the impact on sales, it cannot sell it at a lower price to half the people and at a higher price to the other half. Hence, companies have to deploy some approximate cousins of A/B testing to establish causality. For instance, one of these methods is an interrupted time-series test in which case the feature is introduced and removed multiple times and the goal statistic is observed at all the junctions. If you see a spike every time the change was introduced you can establish, and a drop every time the change was removed you can establish causality.
Overall Evaluation Criteria: Often the primary goal of an experiment is to be chosen very carefully. A wrongly chosen goal to test the experiment can often lead to a wrong result. For instance, an experiment at Microsoft once wanted to observe the impact of showing prices on the product page on product sales. Earlier the price used to be available only after the user clicked on "Buy". The goal of the experiment was chosen to be the conversion rate on the "Buy" button. A 64% drop in conversion rates was observed in the experiment. In other words, just showing the price of the product, one level above in the funnel led to a 64% drop in the number of people clicking on "Buy". The problem with the experiment was that even though there was a 64% drop in the number of people who went to the next step of the funnel, the total number of people completing the sale on the next page increased. Hence, the experiment filtered out all the people who were not comfortable with the price earlier in the funnel. The sales of the product had actually increased.

The way ahead

This note was an introductory note to describe the kind of real-world complexities faced in experimentation. I plan to update this note whenever I find an interesting topic to write about under this heading. In subsequent notes, I will write a blog on each of the above points.