Context: This is the third article in the series, The Statistical Nuances in Experimentation. The parent blog will give you a broader context to the concepts discussed in this article.
To the armchair statistician who just cares about developing a finer statistical model to represent the world, a discussion on sample sizes is probably as dry as it gets. It seems like most theoretical statistics and machine learning conveniently assumes that data has to be enough in accordance with the complexity of the model. However, the challenge of sample sizes is of great concern to applied statistics. Sample sizes represent the amount of fuel any data science model has. Larger sample size is monotonically never a problem.
Experimentation fundamentally relies on sample sizes to reach accurate predictions. Most companies that want to start with experimentation express concern towards whether their sample sizes will be enough. Further, even if you have a lot of data-points, larger sample sizes burst open the limits of what you can do with data. Statistical models get nourished and grow with larger sample sizes. If the data scientist does not recognise the need for expanding the model, he is notified through precise overfitting.
In the last blog on statistical power, we introduced the idea of sample sizes and what are the factors that affect the required sample sizes in a test. In this blog, we expand our discussion on sample sizes and what we can do to optimize them.
False Positive Rates and Visitor Efficiency
The two key metrics that I have realized are the most critical in defining the efficiency of an experimentation engine are the following:
- False Positive Rates: False Positive Rates represent the percentage of cases where your experiment ends in statistically significant whereas in reality there was no difference between the control and the treatment. In a previous blog, I explained why false positive rates are usually higher than we expect them to be and we cannot control them by methodology.
- Visitor Efficiency (Technical Name: Statistical Power): The second parameter of interest is how many visitors you need to get the desired statistical power from the test. In other words, the sample sizes needed to detect a significant difference. I believe that statistical power is the crucial bottleneck for many experimenters. As explained above, optimizing for visitor efficiency plays a huge role in data science, machine learning, and statistics.
There are a few reasons that have made me leave out False Negatives from the above list. I believe that false negatives are caused due to a lack of sample sizes, not due to a problem in methodology. If theoretically one was to have an infinitely large sample size, all tests would be called out to be statistically significant (hence having zero false negative rates). Also, online A/B testing has practically evolved towards sequential testing where an experimenter collects datapoints until they reach the desired statistical power. Lastly, Visitor Efficiency and False Negative Rates are exchangeable. You cannot independently control both (theoretically you can, but in practice, it does not make sense).
Hence, by my current understanding, False Positive Rates and Visitor Efficiency (Statistical Power) are two orthogonal parameters that you can independently optimise in an experiment. I define visitor efficiency as a function of sample sizes that represents the percentage of true differences that get detected within 'n' samples.
Ways to reduce required Sample Sizes
Extending the thread further from statistical power, in practice I believe there are only a few fundamental ways that can be exploited to optimize for sample sizes. These are as follows:
- Compromising on accuracy: The easiest way to reduce the required sample sizes is to compromise on accuracy. In simple words, if you have only 10 data points to do a study, you can still go ahead with the study just that the results of the study would not be reliable. However, you can use the statistical power equation to make an informed choice on what you are compromising - the false positive rate or the false negative rate.
- Increasing the MDE: While at first it seems counter-intuitive, who would not want to detect a larger effect-size in exchange for lesser sample sizes. It seems to be a win-win. But once you realise that the onus of generating that larger effect-size is not on the experiment, but on the experimenter you realise that being able to test for smaller differences is a luxury. Being able to test for a smaller MDE means that you can test out nuanced changes (such as the 41 shades of blue). Larger MDEs require carefully thought of changes that are backed by causal models that define why the change would be impactful. You can always think of generating more meaningful ideas to test if you want to test with a smaller sample size.
- Decreasing the variance: If you look at the equation for statistical power, you will find that a crucial factor in statistical significance is the standard deviation of data. To be more precise, it is the standard deviation of the observed effect size that matters. While it might seem hard to reduce the standard deviation of the data or the observed effect size but there are ways to do so. These ways are called variance reduction techniques and some of the practical applications include paired t-tests and CUPED. One of the ways you can the reduce the variance of your improvement is by creating a one-to-one mapping between samples in the control group and the treatment group. Thisis precisely what paired t-tests do.
- Reusing samples: An advnaced way to optimise the sample sizes you need is by reusing the samples for testing multiple changes. For instance, today a visitor on big giants like Amazon or Facebook simultaneously become a part of multiple experiments at once. So, the homepage you reach might be testing for five different changes at once and advanced algorithms in the background would mathematically breakdown the impact of each idea uniquely. This is possible by a mathematical construct called matrix factorisation and the details of how this works is outside the scope of this blogpost.
From my understanding of statistical power, I feel all different ideas for optimising sample sizes would be falling in one of these four fundamental brackets but I accept the space of unknown unknowns (things that I don't know that I don't know) is large and there might be other shrewd ways to optimise sample sizes. I would be keen to know if there is a fundamental technique that I am missing out.
What should smaller companies do?
A logical question that follows from the discussion above is that in the lack of high traffic what practical advice can be derived out for smaller companies that want to start experimenting. I would list down the following advice for these companies:
- Start with defining metrics: In the lack of traffic, one of the best things that you can invest time is in developing metrics to represent various processes and goals of your organisation. Starting to think in numbers and tracking important metrics is crucial for you to later use this data for experimentation.
- Qualitatively test ideas: Once you get adjusted to a metric-driven mindset, it is not very difficult to start qualitatively testing out new ideas. While calculating statistical significance for smaller changes might be hard, but getting qualitative inputs on bigger ideas is a crucial part of business and a form of qualitative experimentation. For instance, a restaurant can probably not check for statistical significance of which menu items are leading to increased sales. But, it is not very difficult to start getting qualitative feedback for every new dish you try and quickly replacing those that are not being liked in the menu.
- Plan to test bigger changes: One of the things that you should start doing as a smaller company is to accept that nuanced changes might be a luxury for you currently, but larger changes can definitely be tested with a smaller sample size. The thing about bigger changes is that they are easy to find in the starting because the initial prototype would be full of bugs and inefficiencies. Once you get into the experimental mindset you should look for these low hanging fruits because they will be easier to detect and would require smaller sample sizes as well.
- Set up guardrail experimentation: Even if you are a small company, you can always test for some guardrail metrics when you are introducing a change. A guardrail metric is a metric that should not reduce in the process of introducing a change. For instance, page load time might be a good guardrail metric. If you are small you can test for guardrail metrics because if there is a large detoriation in the guardrail metric, you will be able to detect it with much lesser samples but if there is a very nuanced change, it won't matter that much even if it is not detected.
The way ahead
The discussion on sample sizes will keep coming around when we discuss about experimentation. I hope that this and the previous article provides a great base to the reader to understand sample sizes and what factors can you compromise for a smaller sample size.
In the next article in the series of statistical nuances I will throw light on the MDE-Sample Size tradeoff that companies can utilise.