A graduate-level statistics course would usually start with an innocuous discussion on the mean, median, and mode and the fundamentals of a probability distribution. A few weeks later the discussion might still seem tractable as the professor will be telling you all about hypothesis tests, the null hypotheses and p-values which will make you ridicule the apparent aversion most students portray towards statistics. However, before you realize, you will find yourself dabbling with complex statistical assumptions such as normality, stationarity, and homoskedasticity (pronounced 'J-aar-g-un'). While you would still be fumbling to spell these words, the professor will jump to the various kinds of statistical tests that you can run to validate these assumptions for different kinds of data. Finally, by the end of the term, you will realize that there is hardly anything in statistics that comes without assumptions, and hardly any assumptions that come without the need to be tested.
And one fine day, you will wonder whether it was all a facade that statisticians put in place, to nauseate anyone who might be interested in understanding their work. You will wonder, what value the theorems of statistics hold if their very validity hangs over countless assumptions that might not be true in many real-world cases. The fact that even these tests have their own false positive rates that are hinged on certain assumptions will finally make you appreciate the whole world's aversion towards statistics and statisticians.
All of statistics is built on assumptions because our world is much more random, much more chaotic than we can coherently explain with our statistical models. Most statistical models hence make strong assumptions and only give you guarantees of accuracy as long as those assumptions are true. Assumptions allow statisticians to solve for scenarios that were much harder to solve in the real world. One such assumptions is that of stationarity. Funnily enough, it is not talked about very much but it is the central assumption behind A/B Testing.
The dictionary definition of stationary states that something that is stationary does not move. In the statistical context, stationarity refers to the distributions that do not change over time and are stable. As a good proxy to think about stationarity, just question yourself if the mean of a random process can change over time.
To quote a few examples, the distribution of adult human height is perfectly stationary. The distribution of males and females in the population is mostly stationary but might change over a span of a few decades. Finally, the distribution of a stock price will be highly non-stationary because every few weeks, the stock price attains a new equilibrium and hovers around it.
Imagine a website and its conversion rate. It is easily possible that people tend to buy more on weekends and hence conversion rates are much higher during weekends than they are on weekdays. Moreover, it is also possible that different segments of your population have different conversion rates. If your website is global, it might be that the west has a preference for certain things and the east has a preference for certain other things. The conversion rates across different products will hence also vary with day and night when different parts of the world wake up.
When you ask about the average conversion rate of your website, all of these different variations over time and space will be averaged out and you will be provided with a single percentage figure (such as 10%) but that does not mean it is stationary at all. Stationarity also depends on the window you are looking at. Something that can be considered stationary over a year might not be stationary over a week.
Stationarity is a core assumption behind A/B testing, but more often than not online data is not stationary at all. Non-stationarity often causes A/A tests to win or A/B tests to give premature winners that report exaggerated uplifts (winner's curse) Non-stationarity fundamentally weakens the reliability of statistical significance obtained from the test because if the underlying distribution changes in the duration of the test, there is a higher likelihood that it will change in the future as well. Understanding the assumption of stationarity helps you be aware of when it can happen and tackle it accordingly.
Problems with Non-Stationarity
Some of the problems that are caused by non-stationarity are as follows:
- Premature test winners cannot be reliable: Sometimes statistical tests declare a winner much earlier due to a high improvement observed in the starting. If you start a test on a Saturday, a relatively higher conversion rate might lead to a winner being called out on Monday. Such a test would not even see one whole week over which patterns of conversion rate might be cyclic. To be sure you do not face these errors always run tests for at least a week because most real-world users exhibit different patterns over the week.
- The change being tested might be inducing non-stationarity: Say you want to test a change on your website that replaces the current banner, "Pamper yourself! Get a free foot massage" (Control) with a new banner saying, "Beat your Monday blues! Get a free foot massage" (Variation). Such a change might deliberately cause the variation to perform better on Mondays whereas the Control will not show any specific effect on a Monday. Hence, even if over the entire week the Control performs better, specifically on Monday it might seem like the test is favoring the variation. To manage this issue, always try to test changes that do not induce non-stationarity. If you do have to test such changes, make sure the window is large enough to give an accurate picture.
- Multi-armed bandits (MABs) might distribute traffic erroneously: MABs are even more sensitive to stationarity. If one of the variations observes a high number of conversions due to non-stationarity, MABs will start to divert a lot of traffic to those variations and the other variation will be getting much less traffic to be detected even if it later starts to perform better. This is a tricky problem to solve because MABs are deployed as quick optimizations for short-term changes. Increasing the equal allocation period of MABs often ends up contradicting their purpose.
Non-stationarity exists in many forms and the assumption of stationarity is a commonly violated assumption behind A/B testing. However, in most cases, non-stationarity impacts all variations similarly, and hence the bias gets canceled out when you are calculating the statistical significance. Nonetheless, understanding stationarity will make sure you are much more cautious when you are running experiments.
There are some more assumptions that the A/B test makes about the underlying environment. Perfect randomization is one of these assumptions which assumes that the traffic between control and variation is perfectly distributed and nothing correlates with the randomizer itself. Another such assumption is the assumption of Independent and Identically Distributed samples, commonly known as the IID assumption. IID assumes that each data point collected in the test is independent of the other or else the sample might not be representative of the population. These two assumptions are interesting as well and they will be the topic of two separate blogs in the future.
All statistical procedures are hiding some assumptions in the background. Whenever you come across any machine learning model or a data science algorithm that is solving a problem that is stochastic in nature, always make sure you ask what are the assumptions being made and carefully analyze the validity of those assumptions. It will help you see the difference between seemingly similar solutions and help you think which model suits which situation. If the stakes are high, go on to find a way to validate if the real-world data fit that assumption or not. If you are passionate, figure out what does not fit and try to build a model that relies on a more realistic assumption.