Reference: This note details the fifth topic in the 7 broader topics of discussion in experimentation. Further, the points listed in this blog will be later converted into their own blog posts.
As you grow your traffic with experimentation, you see the development of a robust infrastructure for data collection and data analysis. Rarely does this advantage in data stop at simple experimentation. When the amount of data you are collecting increases, a bunch of advanced experimentation tools become accessible to you. You can start to test multiple ideas together (multi-variate testing), maximize sales for a short period of time (multi-armed bandits) and personalize to various user segments in the population (personalization). These tools are all the more exciting but also more difficult to handle and use. In this post, I want to give an overview of these advancements.
Data is essentially silent, it just sits there letting you figure out the meaning from it. Data tends to speak whatever you make it speak. To weave out a story from data, all algorithms do one or more of these three fundamental operations:
- Segmentation of data: To effectively deduce patterns from data, you break down the data into various segments. These segments become the axis of your analysis. If you want to study the impact of weekends on sales, you will probably break down all sales data into seven buckets, each for one day of the week. Advanced algorithms often utilize overlapping segments to increase their learning capability.
- Summarisation of randomness: Each meaningful segment has its own randomness that needs to be summarised to reduce them to a few concise readable numbers. Different data points belonging to the same segment are summarised into their distributions to look at the mean, the median, and the percentiles.
- Decisions on learning: The last step algorithms do is to utilize that information to make future predictions and decisions on it. There are various ways to utilize these segmented distributions to make a coherent decision from them. When your information is based on larger sample sizes (less variance), you are able to take more confident decisions. However, these decisions go wrong when the future does not mimic the past.
Experimentation data is often useful for making it speak a lot of other things apart from statistical significance. But yes it must be kept in mind that in the process of slicing and dicing the data you somewhere compromise on the accuracy of what it will say. Most algorithms use one or the other steps listed above to extract interesting insights and uses of data.
In this post, we give an overview of nine of these variations.
8 Interesting Advanced Algorithms in Experimentation
Some of these nine tools can be found in experimentation products, while some are yet just concepts. All of these tools have their own complexities and nuances that make them separate products in their own right.
- Multi-armed Bandits (MABs): The multi-armed bandit algorithm is the closest variation to a real A/B test. An A/B test maintains a 50-50 traffic split between control and variation to reach the most trustworthy insight from data. The multi-armed bandit on the other hand dynamically allocates more traffic to the winning variation to maximize conversions for a short period of time. MABs are much less trustworthy in A/B tests and hence cannot be used as an alternate to A/B testing. However, for times when you want to try out different things in a 3-day flash sale or in a dynamically trending market, you do not care if your learnings are trustworthy in the long run or not. You rather care about capitalizing on the quick patterns that you see in data and aggressively maximizing conversions. Hence, MABs changed the decision-making procedure. Tools like VWO and Optimizely provide multi-armed bandits in their suite.
- Multi-variate tests (MVTs): Multivariate tests try different combinations of ideas to tell you which combination is working the best. So, if you have ten products and you want to see which combinations go best together, you can run an MVT test. A basic MVT will segment your traffic across all different combinations of products (55 in this case) and show you which ones are working the best. However, they do consume a lot of traffic to learn separately across all buckets. Advanced statistical techniques like matrix factorization often help MVTs to learn exponentially faster by creating overlapping segments in data. Further, MVTs often combine this segment with dynamic traffic allocation (what MABs do) to quickly prune out non-performing combinations. Currenty, products like Intellimize and VWO provide this functionality.
- Personalization: Personalizing your user experience for your major audience segments is a powerful technique that helps drive conversion rates up. Imagine you run an experiment on your website. where you try out an English (control) and a Spanish (variation) banner on your homepage. Suppose that you see a 10% conversion rate on the English banner and an 8% conversion rate on the Spanish banner. The A/B Testing Engine will declare the English banner to be a winner and divert all traffic to the same (getting you a 10% conversion rate). However, a personalisation engine in this case would learn that users from the US are preferring the English version, whereas users from Spain are preferring the Spanish version and hence show the preferred variation in both countries, hence getting you a 18% conversion rate. Personalisation segments the learning process across various dimensions of customer attributes and accordingly take decisions. The most popular personalisation algorithms today are Rule-based Personalisation Engine (diverts traffic on defined rules), AI Personalisation (automatically learns the rules based on customer preferences) and Recommendation Systems (automatically matches customer preferences with customer attributes and recommends content). Tools like AB Tasty, Intellimize and VWO provide these capabilities.
- Interesting Segments: Interesting Segments is the pre-cursor to Personalisation. Often in an A/B test, an algorithm can slice and dice your data to show you the major audience segments in your data and the segments which are impacted differently by various ideas. This process of slicing and dicing after the experiment data has been collected is very useful to get deeper insights into your audience but the causal accuracy of the insight is compromised to a great extent. Hence, insights observed from an Interesting Segments algorithm need to be tested with a proper A/B test. Note that Interesting Segments is a statistical inference tool and not a tool that automatically makes decisions for you. Personalisation tools on the other hand are designed to be much more careful for taking decisions on data. Tools like Mix Panel provide this functionality.
- Narrative Testing: Sometimes a change made on one page in the funnel affects an experiment being run on a different page. For instance, suppose that you are running a pricing experiment on the product page (9$ and 11$) to see what would be the best price for the maximum conversion rate. Now imagine, if someone runs an experiment on the homepage advertising free delivery after a certain pricepoint (say 10$). The experiment on the pricing page will be biased by this extra information on the homepage. Most users who see a price of 9$ might not end up converting because they will be paying an extra delivery fees because they landed in the 9$ variation. Narrative Testing is a mathematical concept that helps you connect multiple experiments together and see the impact of different narratives (combinations of variation shown across experiments) together. (Honestly, narrative testing is a concept I researched on independently and I haven't seen any product implementations of it till date. However, I have seen many problems with experimentation that fit the narrative testing framework.)
- Meta-analysis: It is well known in the experimentation community that a single experiment often leads to false positives and wrong conclusions. Many scientific studies have been found to be erraneous and unreplicable. The solution for the same is meta-analysis. A meta-analysis tool helps you connect many scientific experiments together and lets you see the statistical significance of an idea across many instances of experiments corresponding to the same. For instance, if you have an idea that a green button color elicits more sales than a red button, you can test this idea at multiple places on your website and connect all these experiments into meta-analysis to see how many times the idea was found to be statistically significant. Meta-analysis is the only tool in this list that has a higher trustworthiness than simple A/B testing itself. However, meta-analysis is harder to conduct and I am not aware of any third-party products that give a generic tool for meta-analysis.
- Uncontrolled Studies: Many a times, a controlled experiment is not possible to conduct. For instance, if you want to try out a new pricing for your product, it might be unethical to sell the product at different prices to different customers. Hence, many forms of uncontrolled studies help you establish some level of causality in the lack of a controlled experiment. For instance, a time-interrupted series proposes the following. Introduce the change for short periods of time and then roll them back. Do this procedure three or four times. If everytime you make the change, you see a change in the desired goal metric as well you can establish with some level of causality between the change and the goal. Formalised mathematics help you analyse this data and declare a level of statistical significance, but yet a time-interrupted series has a much lower confidence because it is not done in a controlled environment. Some other types of uncontrolled experiments is regression-discontinuity design and quasi-experimental designs.
- Causal Inference: The field of causality is much more deep and I declare an entire separate thread to the topic. The objective of causal inference is to establish causality from purely observational data (which is currently not possible). Statistics gives us the power to find correlations among various variables but figuring out if the correlations are actual causal links or not requires us to run an experiment. The dream of causal inference is to give you a reilable procedure that establishes causality without running an experiment. Attempts along these lines are being made by startups such as Causal Lens but to the best of my knowledge this is not a possibility in fundamental statistics as of now.
The way ahead
Some of these tools, I believe give an insight into what the future holds for experimentation. But it must be noted that a lot of these advanced threads will be tested through time and some of them would rightfully be rejected as ineffective FADs. At the same time, I expect many other threads of experimentation to emerge and as I get to know more of these variations, I will add them to this list.
In later blogs, I will try to detail each of the above topics. In these blogs, my aim would be to explain the algorithmic variations behind these tools and the practical use-cases these products come useful in.