For all the things we learned in grad school, Bayesian methods was something that was skimmed over. Strange too, as we learned all the computationally machinery necessary, but we were never actually shown the power of these methods. Let's start our explanation with an example where the Bayesian analysis clearly simply is *more correct** (in the sense of getting the right answer). *

### The Table Game

Bob and Alice both approach a table at a casino. The dealer at the table has chosen a number between 0 and 1 that stays fixed for the duration of the game, and is hidden from Alice and Bob. Each round, a new random number between 0 and 1 is produced. Alice scores a point if the random number falls below the dealer's hidden number, and Bob scores a point if the random number is above. The game ends when either Alice or Bob has scored 6 points.

After 8 rounds, Alice has a score of 5 versus Bob's score of 3. Bob, a statistician, asks "What is the probability I win the game, given these two scores?"

**Frequentist Bob thinks:** This is a simple problem. Alice's score is the same as a $Bin( 8, p)$ random variable, where I don't know the parameter $p$. Her score is observed to be $5$. The MLE of $p$ is $\hat{p} = 0.625 \pm 0.28$. In order for me to win the overall game, I must win the next three games, else Alice scores 6 points and wins the game. Thus, $P( \text{Bob wins} ) = ( 1- p)^3$, and by that useful invariance property of MLEs, I can estimate my probability to be $( 1- 5/8)^3 = 0.052$, with 95% confidence interval $( 0.008, 0.28 )$.

Let me do some Python to confirm. I'll perform simulated games by

- Randomly picking a $p$ with uniform probability.
- Perform 8 rounds. If the game, after the 8th round, is 5vs3 for Alice, we simulate 3 more rounds: if any of them are in favour of Alice, she wins, else Bob wins. If the game is not 5vs3, we start a new game.
- What is the proportion of games Bob wins versus Alice.

from __future__ import division #Alice vs Bob in table game import random max_simulations = 1e5 simulation = 0 wins_alice = 0 wins_bob = 0 while simulation < max_simulations: #draw random p p = random.random() #draw eight trials Alice_wins = sum( [ random.random() < p for i in range(8) ] ) if Alice_wins == 5: simulation += 1 #This is case of 5vs3 in 8th round, lets check who wins by drawing three more if any( [ random.random() < p for i in range(3) ] ): wins_alice +=1 else: wins_bob +=1 print "Proportion of Alice wins: %.3f."%( wins_alice/max_simulations ) #0.909

Hmm, I get the probability 9.1%, almost twice as much as Bob predicted. Maybe there weren't enough iterations, let me try again.

max_simulations = 1e6 ... print "Proportion of Alice wins: %.3f."%( wins_alice/max_simulations ) #0.909

Ok, so something is wrong with our estimation. The problem is that regardless of the true value $p$, the MLE estimator, given a 5vs3 game, always returns the same thing. Ask, what is the prior beliefs of Alice and Bob before coming to the table? Both can view $p$ as a random variable: the dealer may have chosen $p$ at random, hence it really is a random variable! But the frequentist Bob does not see it this way. In his mind, it is fixed. Let's explore what the Bayesian Bob thinks when the game hits 5vs3:

**Bayesian Bob thinks:** I really had no idea what $p$ could be before I entered this game, so really to me, the value of $p$ was uniform over $[0,1]$. But now Alice leads 5 to 3. I should update what I think $p$ is after seeing this data. I still think this updated $p$ is a random variable, but **some values of $p$ are more likely than others**. So, the probability I win should be calculated considering all possible values of $p$:
$$ \begin{aligned}
P ( \text{Bob wins} ) &= \int_0^1 (1-p)^3P( p | X = 5, n = 8) \; dp \\
& = E_{P( p | X,n )} [ ( 1 - p)^3 ]
\end{aligned}
$$

This can be computed in closed form, but I'm not teaching a calculus tutorial, so I'll skip it. I'd rather use PyMC, a Python library for performing Bayesian analysis. It's a great, underused library, and I'll go into details about it my next Bayesian post. The code below is pretty self explanatory, except that we need to employ the mathematical phenomenon of Markov Chain Monte Carlo. If you are unfamiliar, that's okay, you don't need to know it for this. But there a lots of great tutorials available. Basically, what we want is to be able to sample from the posterior distribution, $P( p | X = 5, n = 8 )$. If we have many sample from there, we can do pretty much anything (things I will get into in another post). But for now, let's compute that integral. It can be approximated by the sum: $$ E_{P( p | X,n )} [ ( 1 - p)^3 ] \approx \frac{1}{N} \sum_{i=0}^N (1 - p_i)^3, \; p_i \text{ comes from } P( p | X = 5, n = 8 )$$

import pymc as mc theta = mc.Uniform( "theta", 0, 1) obs = mc.Binomial( "obs", n = 8, value = 5, p = theta, observed = True) model = mc.Model( {"obs":obs, "theta": theta } ) mcmc = mc.MCMC( model ) #perform MCMC to generate (1000-2000)/2 samples. mcmc.sample( 100000, burn = 2000, thin=2) samples = mcmc.trace("theta").gettrace() #compute the integral above print "Bayesian Estimate: %.4f."%( (1 - samples)**3).mean() #Bayesian Estimate: 0.0902.

Fuck yea. So, aside from the simulation error, we arrive at the right answer. FYI, the integral can be calculated and is equal to $1/11 = 0.9091$.

A few other comments: It is meaningless to say "what is the estimated value of $p$?" in a Bayesian setting. Our analysis did not return an estimate of $p$: it returned a probability distribution. Using the samples from the MCMC, we can see what this distribution looks like:

### Meta-analysis of the table game

In our Python simulation, we chose the value $p$ to be uniformly random. And for our Bayesian analysis, we assumed a uniformly random prior. Are we cheating? Not really. Regardless of how $p$ is generated, if the game is at 5 vs 3, the frequentist's MLE gives the same result, $\hat{p} = 5/8$. We could generate $p$ differently, say from a distribution where points around 1/2 are more likely, and our Bayesian answer would only change slightly, but still be more accurate than the MLE answer. On other hand, we can update our prior to reflect the $p$ is more likely to be around 1/2 if we know this (or believe this to be true).

### Conclusion

Some things I want to discuss in the next parts of this series:

- Using Bayesian analysis to optimize loss functions
- Solve the overfitting problem with Bayesian analysis

### Bibliography

- Eddy, Sean R. "What is Bayesian statistics?." (2004): 1177-1178.

Other articles to enjoy:

- Multi-Armed Bandits
- Machine Learning counter-examples
- How to solve the
*Price is Right's*Showdown - An algorithm to sort "Top" Comments