Reference to past blog: In a previous blog, we detailed and completed the story of a Frequentist Hypothesis Test which is the fundamental pivot of the entire Frequentist ideology. We now move towards the question of how Bayesians solved the problem that Frequentists could not.
We are now reaching close to the most interesting debate in statistics of Frequentists and Bayesians. While I will explain the ideologies in detail in the next three blogs, I will devote this blog to explaining you the core insight of Bayesians that led to the solving of backward probabilities. With the lens of forward and backward probabilities it becomes fairly straightforward to break down the differences between Frequentists and Bayesians. (I explained the concept of forward and backward probabilities in this post (second section).)
Bayesian Statistics is the branch of statistics that evolved out of a simple discovery, The Bayes Rule. I understood the Bayes Rule with a simple tree structure that sums up the entire story.
Imagine a jury that has a 20% false positive rate (1 in 5 times, it convicts an innocent) and a 18% false negative error rate (so 18 in 100 times, it lets off a criminal). Moreover, among the people who face the jury, 85% of the people are actually guilty. The question is that if the jury convicts someone of a crime, what is the chance that the person is actually innocent. Notice that this is a backward probability question.
The Bayes Rule
Observe that it is a query in backward probability and as it is a question asked not about the effect, but about the causes of the effect. Had the question been a forward probability question, such as what is the probability that a person was innocent and convicted, you would have multiplied the probabilities in a chain and the answer would have been 0.15*0.2 = 0.03 which is 3%. Note that this is exactly the p-value analog of this question. In other words, "What is the probability that the person would have been convicted (probability of observed event) assuming that he is innocent (the null hypothesis).
However, with Bayes Rule the backward probability of the person being innocent (cause) given that he was convicted (effect) is a bit different. To measure this you need to know the total probability of the person being convicted. Note that there are two cases:
1. Null Hypothesis: If the person was innocent (15%), then he would be convicted (20%) of times = 0.15*0.20 = 0.03 = 3%.
2. Alternate Hypothesis: If the person was guilty (85%), then he would be convicted 82% of the times, hence 69.7% times.
Observe that contrary to the intuition behind the frequentist statistics, it is not necessary that the probability assuming the null hypothesis and the alternate hypothesis always sums up to 1.
The sum of 3% and 69.7% make up the denominator in the Bayes Rule = 72.7%. Essentially, this is the total probability of the observed event happening (which can be < 1). Now if there was a 72.7% chance of conviction, then 3% times out of that a person was innocent. So, the probability that a convicted person was actually innocent is essentially 3/72.7 = 4.13%. This simply is the Bayes Rule. Notice two things about the backward probabilities.
- Backward Probabilities are exactly proportional to their forward probability counterparts. They just have a normalising constant of total forward probability in the denominator which is not always equal to 1. Hence, forward probabilities are indicative of backward probabilities. However, even with a p-value is less than 0.05, it might be that forward probability from the alternate hypothesis is lesser and hence the likelihood of the null hypothesis is higher.
- Backward Probabilities are hard because they need the sum of all the threads that lead to the observed event: Notice that forward probabilities were exclusive. You could calculate the probability of one thread of events without knowing the other (for instance,innocent and convicted can be directly calculated in forward probability without the knowledge of guilty and convicted). However, this is not possible with backward probabilities. You need to know the probability of all the causes (innocent and guilty) that could have led to the observed effect (conviction) and then look at the proportion of the cause in question (innocent). This is what kept the development of backward probabilities at bay even though the Bayes Rule was invented much earlier in 1763.
Calculation of all threads leading to the observed effect is hard because you need to start with a base model that defines the probability of all alternatives. In the particular question, you would never know the true distribution of innocents and guilty in the real world. Hence, Bayesians reversed the flow of information and invented an idea that was pure genious.
Priors - the heart of Bayesian
In the simplest words I can tell, Bayesians tied the tongue of the dog back to its tail. Bayesians realised that with the knowledge of all threads, backward, probability calculations would not just give a point probability estimate of one thread, but the entire distribution of all threads in one go.
The power of backward probabilities was that by nature it lives one level above forward probabilities because backward probabilities are not point estimates but entire distributions. The difference stems from the fact that forward probabilities can be calculated in isolation of other possibilities, but Bayesian Probabilities cannot be calculated in isolation. They called this distribution of backward probabilities learned from data as the Posterior Distribution.
But the question still remained, that to calculate the posterior distribution, they still needed to start with some prior information that can calculate the first iteration without biasing the results. This prior information was hard to know and was to be learned from data itself.
So they tied the tongue back to its tail by creating a flat uninformative prior distribution to start the first iteration of the loop. The Prior Distribution became the pinnacle of the Bayesian method and filled up the gap that was blocking Frequentists from calculating the p-value from the alternate hypothesis. The vertical distribution on the left is the prior in the image.
Just as shown in the figure, Bayesians filled in a model to define the probabilities of all hypothesis at once and then reversed the story.
Bayesians do not try to predict the probability of the observed data, they instead keep a prior belief that is not based on data and then iteratively update their beliefs with new information that comes along. The posterior from the first iteration becomes the prior for next iteration.
The way ahead
In simple words, Bayesian A/B Tests start with a flat uninformative prior and then update the prior by the Bayes Rule using data. It is complex maths hocus-pocus to listen to but usually it boils down to very simple update equations in practice. Bayesians then used these updated posteriors to answer all sorts of backward and forward probability questions they could think of, such as what is the probability that variation is better than control (Backward)? What is the loss I might face if I am going with the control (Forward)? And so on.
The next note will be short and will run you through the architecture of the Bayesian counterpart to Frequentist Hypothesis Tests.
Post Notes: I apologise for the heavy discussion today. I would like to point out that the actual events might not have happened the exact way I have narrated. But this is the most logical path I could make out of my understanding towards why Bayesians did what they did. Please do point out to me if there is something wrong in my understanding.