Multi-Armed Bandits

Preface: This example is a (greatly modified) excerpt from the open-source book Bayesian Methods for Hackers, currently being developed on Github

Adapted from an example by Ted Dunning of MapR Technologies

The Multi-Armed Bandit Problem

Suppose you are faced with $N$ slot machines (colourfully called multi-armed bandits). Each bandit has an unknown probability of distributing a prize (assume for now the prizes are the same for each bandit, only the probabilities differ). Some bandits are very generous, others not so much. Of course, you don't know what these probabilities are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings.

Of course, if we knew the bandit with the largest probability, then always picking this bandit would yield the maximum winnings. So our task can be phrased as "Find the best bandit, and as quickly as possible".

The task is complicated by the stochastic nature of the bandits. A suboptimal bandit can return many winnings, purely by chance, which would make us believe that it is a very profitable bandit. Similarly, the best bandit can return many duds. Should we keep trying losers then, or give up?

A more troublesome problem is, if we have a found a bandit that returns pretty good results, do we keep drawing from it to maintain our pretty good score, or do we try other bandits in hopes of finding an even-better bandit? This is the exploration vs. exploitation dilemma.

Applications

The Multi-Armed Bandit problem at first seems very artificial, something only a mathematician would love, but that is only before we address some applications:

  • Internet display advertising: companies have a suite of potential ads they can display to visitors, but the company is not sure which ad strategy to follow to maximize sales. This is similar to A/B testing, but has the added advantage of naturally minimizing strategies that do not work (and generalizes to A/B/C/D... strategies)
  • Ecology: animals have a finite amount of energy to expend, and following certain behaviours has uncertain rewards. How does the animal maximize its fitness?
  • Finance: which stock option gives the highest return, under time-varying return profiles.
  • Clinical trials: a researcher would like to find the best treatment, out of many possible treatments, while minimizing losses.

Many of these questions above are fundamental to the application's field.

It turns out the optimal solution is incredibly difficult, and it took decades for an overall solution to develop. There are also many approximately-optimal solutions which are quite good. The one I wish to discuss is one of the few solutions that can scale incredibly well. The solution is known as Bayesian Bandits.

A Proposed Solution

Any proposed strategy is called an online algorithm (not in the internet sense, but in the continuously-being-updated sense), and more specifically a reinforcement learning algorithm. The algorithm starts in an ignorant state, where it knows nothing, and begins to acquire data by testing the system. As it acquires data and results, it learns what the best and worst behaviours are (in this case, it learns which bandit is the best). With this in mind, perhaps we can add an additional application of the Multi-Armed Bandit problem:

  • Psychology: how does punishment and reward effect our behaviour? How do humans' learn?

The Bayesian solution begins by assuming priors on the probability of winning for each bandit. In our vignette we assumed complete ignorance of the these probabilities. So a very natural prior is the flat prior over 0 to 1. The algorithm proceeds as follows:

For each round,
  1. Sample a random variable $X_b$ from the prior of bandit $b$, for all $b$.
  2. Select the bandit with largest sample, i.e. select bandit $B = \text{argmax}\;\; X_b$.
  3. Observe the result of pulling bandit $B$, and update your prior on bandit $B$.
  4. Return to 1.

That's it. Computationally, the algorithm involves sampling from $N$ distributions. Since the initial priors are $\text{Beta}(\alpha =1,\beta =1)$ (a uniform distribution), and the observed result $X$ (a win or loss, encoded 1 and 0 respectfully) is Binomial, the posterior is a $\text{Beta}( \alpha = 1 + X, \beta = 1 + 1 - X)$ (see here for why to is true).

To answer a question from before, this algorithm suggests that we should not discard losers, but we should pick them at a decreasing rate as we gather confidence that there exist better bandits. This follows because there is always a non-zero chance that a loser will achieve the status of $B$, but the probability of this event decreases as we play more rounds (see figure below).

Below is an implementation of the Bayesian Bandits strategy (which can be skipped for the less Pythonic-ly interested).

Below we present a visualization of the algorithm sequentially learning the solution. In the figure below, the dashed lines represent the true hidden probabilities, which are [0.85, 0.60, 0.75] (this can be extended to many more dimensions, but the figure suffers, so I kept it at 3).

Note that we don't real care how accurate we become about inference of the hidden probabilities -- for this problem we are more interested in choosing the best bandit (or more accurately, becoming more confident in choosing the best bandit). For this reason, the distribution of the red bandit is very wide (representing ignorance about what that hidden probability might be) but we are reasonably confident that it is not the best, so the algorithm chooses to ignore it.

Below is a D3 demonstration of this updating/learning process. The challenge is only three arms, each with a small probability (randomly generated). The first figure is the pull/reward count, and the second figure contains the posterior distributions of each bandit.


Rewards

0

Pulls

0

Reward/Pull Ratio

0

Deviations of the observed ratio from the highest probability is a measure of performance. For example, in the long run, optimally we can attain the reward/pull ratio of the maximum bandit probability. Long-term realized ratios less than the maximum represent inefficiencies. (Realized ratios larger than the maximum probability is due to randomness, and will eventually fall below).

A Measure of Good

We need a metric to calculate how well a strategy is doing. Recall the absolute best we can do is to always pick the bandit with the largest probability of winning. Denote this best bandit's probability of $w^*$. Our score should be relative to how well we would have done had we chosen the best bandit from the beginning. This motivates the total regret of a strategy, defined:

\begin{align} R_T & = \sum_{i=1}^{T} \left( w^* - w_{B(i)} \right) \\ & = Tw^* - \sum_{i=1}^{T} \; w_{B(i)} \end{align}

where $w_{B(i)}$ is the probability of a prize of the chosen bandit in the $i$th round. A total regret of 0 means the strategy is matching the best possible score. This is likely not possible, as initially our algorithm will make the wrong choice. Ideally, a strategy's total regret should flatten as it learns the best bandit. (Mathematically we achieve $w_{B(i)} = w^* $ often)

Below we plot the total regret of this simulation, including the scores of some other strategies:

  1. Random: randomly choose a bandit to pull. If you can't beat this, just stop.
  2. largest Bayesian credible bound: pick the bandit with the largest upper bound in its 95% credible region of the underlying probability.
  3. Bayes-UCB algorithm: pick the bandit with the largest *score*, where score is a dynamic quantile of the posterior (see [4] )
  4. Mean of posterior: choose the bandit with the largest posterior mean. This is what a human player (sans computer) would likely do.
  5. Largest proportion: pick the bandit with the current largest observed proportion of winning.

To be more scientific so as to remove any possible luck in the above simulation, we should instead look at the expected total regret:

$$\bar{R_T} = E[ R_T ] $$

It can be shown that any sub-optimal strategy's expected total regret is bounded below logarithmically. Formally,

$$ E[R_T] = \Omega \left( \;\log(T)\; \right)$$

Thus, any strategy that matches logarithmic-growing regret is said to "solve" the Multi-Armed Bandit problem [1].

Using the Law of Large Numbers, we can approximate Bayesian Bandit's expected total regret by performing the same experiment many times (25 times, to be fair):

Extending the algorithm

Because of the algorithm's simplicity, it is easy to extend. Some possibilities:

  1. If interested in the minimum probability (eg: where prizes are are bad thing), simply choose $B = \text{argmin} \; X_b$ and proceed.

  2. Adding learning rates: Suppose the underlying environment may change over time. Technically the standard Bayesian Bandit algorithm would self-update itself (awesome) by learning that what it thought was the best is starting to fail more often. We can also encourage the algorithm to learn changing environments quicker. We simply need to add a rate term upon updating:

    self.wins[ choice ] = rate*self.wins[ choice ] + result
    self.trials[ choice ] = rate*self.trials[ choice ] + 1
                

If rate < 1, the algorithm will forget its previous results quicker and there will be a downward pressure towards ignorance. Conversely, setting rate > 1 implies your algorithm will act more risky, and bet on earlier winners more often and be more resistant to changing environments.

  1. Hierarchical algorithms: We can setup a Bayesian Bandit algorithm on top of smaller bandit algorithms. Suppose we have $N$ Bayesian Bandit models, each varying in some behavior (for example different rate parameters, representing varying sensitivity to changing environments). On top of these $N$ models is another Bayesian Bandit learner that will select a sub-Bayesian Bandit. This chosen Bayesian Bandit will then make an internal choice as to which machine to pull. The super-Bayesian Bandit updates itself depending on whether the sub-Bayesian Bandit was correct or not.
  2. Extending the rewards, denoted $y_a$ for bandit $a$, to random variables from a distribution $f_{y_a}(y)$ is straightforward. More generally, this problem can be rephrased as "Find the bandit with the largest expected value", as playing the bandit with the largest expected value is optimal. In the case above, $f_{y_a}$ was Bernoulli with probability $p_a$, hence the expected value for an bandit is equal to $p_a$, which is why it looks like we are aiming to maximize the probability of winning. If $f$ is not Bernoulli, and it is non-negative, which can be accomplished apriori by shifting the distribution (we assume we know $f$), then the algorithm behaves as before: For each round,

  1. Sample a random variable $X_b$ from the prior of bandit $b$, for all $b$.
  2. Select the bandit with largest sample, i.e. select bandit $B = \text{argmax}\;\; X_b$.
  3. Observe the result,$R \sim f_{y_a}$, of pulling bandit $B$, and update your prior on bandit $B$.
  4. Return to 1.
The issue is in the sampling of $X_b$ drawing phase. With Beta priors and Bernoulli observations, we have a Beta posterior -- this is easy to sample from. But now, with arbitrary distributions $f$, we have a non-trivial posterior. Sampling from these can be difficult. For the special case when the rewards distributions are confined to [0,1], [5] provides a general algorithm:
  1. Sample a random variable $X_b$ from the prior of bandit $b$, for all $b$.
  2. Select the bandit with largest sample, i.e. select bandit $B = \text{argmax}\;\; X_b$.
  3. Observe the result,$R \sim f_{y_a}, R \in [0,1]$, of pulling bandit $B$.
  4. Perform a Bernoulli trial with success probability $R$ and observe output $r$.
  5. Update posterior $B$ with observation $r$.
  6. Return to 1.

Commenting systems

There has been some interest in extending the Bayesian Bandit algorithm to commenting systems. Recall in a previous post, we developed a ranking algorithm based on the Bayesian lower-bound of the proportion of upvotes to total total votes. One problem with this approach is that it will bias the top rankings towards older comments, since older comments naturally have more votes (and hence the lower-bound is tighter to the true proportion). This creates a positive feedback cycle where older comments gain more votes, hence are displayed more often, hence gain more votes, etc. This pushes any new, potentially better comments, towards the bottom. J. Neufeld proposes a system to remedy this that uses a Bayesian Bandit solution.

His proposal is to consider each comment as a Bandit, with a the number of pulls equal to the number of votes cast, and number of rewards as the number of upvotes, hence creating a $\text{Beta}(1+U,1+D)$ posterior. As visitors visit the page, samples are drawn from each bandit/comment, but instead of displaying the comment with the $\max$ sample, the comments are ranked according the the ranking of their respective samples. From J. Neufeld's blog [7]:

[The] resulting ranking algorithm is quite straightforward, each new time the comments page is loaded, the score for each comment is sampled from a $\text{Beta}(1+U,1+D)$, comments are then ranked by this score in descending order... This randomization has a unique benefit in that even untouched comments $(U=1,D=0)$ have some chance of being seen even in threads with 5000+ comments (something that is not happening now), but, at the same time, the user will is not likely to be inundated with rating these new comments.

Conclusion

The Bayesian Bandit solution is a novel approach to a classic AI problem. It also scales really well as the number of possibilities increases. There has been some interesting work on using Bayesian Bandits with independent variables in online advertising, funded mostly by Google and Yahoo (papers cited below).

Continue the discussion at Hacker News and Reddit.

References

  1. Kuleshov, Volodymyr, and Doina Precup. "Algorithms for the multi-armed bandit problem." Journal of Machine Learning Research. (2000): 1-49. Print.
  2. Github et. al. "Probabilistic Programming and Bayesian Methods for Hackers: Using Python and PyMC" (2013)
  3. Scott, Steven L. "A modern Bayesian look at the multi-armed bandit." Applied Stochastic Models in Business and Industry. 26. (2010): 639–658. Print.
  4. Dunning, Ted. "Bayesian Bandits." Surprise and coincidence - Musicing from the Long Tail. Blogger, 18 Feb 2012. Web. Web. 6 Apr. 2013. .
  5. Agrawal, Shipra, and Goyal Navin. "Analysis of Thompson Sampling for the multi-armed bandit problem." (2012): n. page. Web. 15 Apr. 2013. .


    Other articles to enjoy:

    Follow me on Twitter at cmrn_dp


    All Blog Articles

    Generating exponential survival data

    March 02th, 2014

    TLDR: Suppose we interested in generating exponential survival times with scale parameter $\lambda$, and having $\alpha$ probability of censorship ( $0 < \alpha < 1$. This is actually, at least from what I tried, a non-trivial problem. Here's the algorithm, and below I'll go through what doesn't work to:

    continue reading...


    Deriving formulas for the expected sample size needed in A/B tests

    December 27th, 2013

    Often an estimate of the number of samples need in an A/B test is asked. Now I've sat down and tried to work out a formula (being disatisfied with other formulas' missing derivations). The below derivation starts off with Bayesian A/B, but uses frequentist methods to derive a single estimate (God help an individual interested in a posterior sample size distribution!)

    continue reading...


    lifelines: survival analysis in Python

    December 19th, 2013

    The lifelines library provides a powerful tool to data analysts and statisticians looking for methods to solve a common problem:

    How do I predict durations?

    This question seems very vague and abstract, but thats only because we can be so general in this space. Some more specific questions lifelines will help you solve are:

    continue reading...


    Evolutionary Group Theory

    October 03th, 2013

    We construct a dynamical population whose individuals are assigned elements from an algebraic group \(G\) and subject them to sexual reproduction. We investigate the relationship between the dynamical system and the underlying group and present three dynamical properties equivalent to the standard group properties.

    continue reading...


    Videos about the Bayesian Methods for Hackers project

    August 25th, 2013

    1. New York Tech Meetup, July 2013: This one is about 2/3 the way through, under the header "Hack of the month"

      Available via MLB Media player
    2. PyData Boston, July 2013: Slides available here

      Video available here.
    continue reading...


    Warrior Dash 2013

    August 03th, 2013

    Warrior dash data, just like last year: continue reading...


    The Next Steps

    June 16th, 2013

    June has been an exciting month. The opensource book Bayesian Methods for Hackers I am working on blew up earlier this month, propelling it into Github's stratosphere. This is both a good and bad thing: good as it exposes more people to the project, hence more collaborators; bad because it is showing off an incomplete project -- a large fear is that advanced data specialists disparage in favour of more mature works the work to beginner dataists.

    continue reading...


    NSA Honeypot

    June 08th, 2013

    Let's perform an experiment.

    continue reading...


    21st Century Problems

    May 16th, 2013

    The technological challenges, and achievements, of the 20th century brought society enormous progress. Technologies like nuclear power, airplanes & automobiles, the digital computer, radio, internet and imaging technologies to name only a handful. Each of these technologies had disrupted the system, and each can be argued to be Black Swans (à la Nassim Taleb). In fact, for each technology, one could find a company killed by it, and a company that made its billions from it.

    continue reading...


    ML Counterexamples Pt.2 - Regression Post-PCA

    April 26th, 2013

    Principle Component Analysis (PCA), also known as Singular Value Decomposition, is one of the most popular tools in the data scientist's toolbox, and it deserves to be there. The following are just a handful of the uses of PCA:

    • data visualization
    • remove noise
    • find noise (useful in finance)
    • clustering
    • reduce dataset dimension before regression/classification, with minimal negative effect
    continue reading...


    Machine Learning Counterexamples Pt.1

    April 24th, 2013

    This will the first of a series of articles on some useful counterexamples in machine learning. What is a machine learning counterexample? I am perhaps using the term counterexample loosely, but in this context a counterexample is a hidden gotcha or otherwise a deviation from intuition.

    Suppose you have a data matrix $X$, which has been normalized and demeaned (as appropriate for linear models). A response vector $Y$, also standardized, is regressed on $X$ using your favourite library and the following coefficients, $\beta$, are returned:

    continue reading...


    Multi-Armed Bandits

    April 06th, 2013

    Suppose you are faced with $N$ slot machines (colourfully called multi-armed bandits). Each bandit has an unknown probability of distributing a prize (assume for now the prizes are the same for each bandit, only the probabilities differ). Some bandits are very generous, others not so much. Of course, you don't know what these probabilities are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings.

    continue reading...


    Cover for Bayesian Methods for Hackers

    March 25th, 2013

    The very kind Stef Gibson created an amazing cover for my open source book Bayesian Methods for Hackers. View it below:

    continue reading...


    An algorithm to sort "Top" Comments

    March 10th, 2013

    Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is not a good reflection of the true value of the product.

    This has created flaws in how we sort items. Many people have realized that sorting online search results by their rating, whether the objects be books, videos, or online comments, return poor results.

    continue reading...


    How to solve the Price is Right's Showdown

    February 05th, 2013

    Preface: This example is a (greatly modified) excerpt from the book Probabilistic Programming and Bayesian Methods for Hackers in Python, currently being developed on Github ;)

    How to solve* the Showdown on the Price is Right

    *I use the term loosely and irresponsibly.

    It is incredibly surprising how wild some bids can be on The Price is Right's final game, The Showcase. If you are unfamiliar with how it is played (really?), here's a quick synopsis:

    continue reading...


    My favourite part of The Extended Phenotype

    February 02th, 2013

    To quote directly from the book, by Richard Dawkins:

    continue reading...


    The awesome power of Bayesian Methods - Part II - Optimizing Loss Functions

    January 10th, 2013

    Hi again, this article will really show off the flexibility of Bayesian analysis. Recall, Bayesian inference is basically being interested in the new random variables, $\Theta$, distributed by $$ P( \Theta | X ) \propto L( X | \Theta )P(\Theta )$$ where $X$ is observed data, $L(X | \Theta )$ is the likelihood function and P(\Theta) is the prior distribution for $\Theta$. Normally, computing the closed-form formula for the left-hand side of the above equation is difficult, so I say screw closed-forms. If we can sample from $P( \Theta | X )$ accurately, then we can do as much, possibly more, than if we just had the closed-form. For example, by drawing samples from $P( \Theta | X )$, we can estimate the distribution to arbitrary accuracy. Or find expected values for easily using Monte Carlo. Or maximize functions. Or...well I'll get into it.

    continue reading...


    Interior Design with Machine Learning

    January 04th, 2013

    While designing my new apartment, I found a very cool use of machine learning. Yes, that's right, you can use machine learning in interior design. As crazy as it sounds, it is completely legitimate.

    continue reading...


    The awesome power of Bayesian Methods - What they didn't teach you in grad school. Part I

    December 27th, 2012

    For all the things we learned in grad school, Bayesian methods was something that was skimmed over. Strange too, as we learned all the computationally machinery necessary, but we were never actually shown the power of these methods. Let's start our explanation with an example where the Bayesian analysis clearly simply is more correct (in the sense of getting the right answer).

    continue reading...


    How to bootstrap your way out of biased estimates

    December 06th, 2012

    Bootstrapping is like getting a free lunch, low variance and low bias, by exploiting the Law of Large numbers. Here's how to do it:

    continue reading...


    High-dimensional outlier detection using statistics

    November 27th, 2012

    I stumbled upon a really cool idea of detecting outliers. Classically, one can plot the data and visually find outliers. but this is not possible in higher-dimensions. A better approach to finding outliers is to consider the distance of each point to some central location. Data points that are unreasonably far away are considered outliers and are dealt with.

    continue reading...


    A more sensible omnivore.

    November 17th, 2012

    My girlfriend, who is a vegetarian, and I often discuss the merits and dismerits of being a vegetarian. Though I am not a vegetarian (though I did experiment with veganism and holistic diets during some One Week Ofs), very much agree that eating as much meat as we do is not optimal.

    Producing an ounce of meat requires a surprising amount of energy, whereas it's return energy is very small. We really only eat meat for its taste. It is strange how often we, the human omnivores, require meat in a meal, less it's not a real meal (and we do this three times a day). And unfortunately, a whole culture eating this way is not sustainable.

    I have often thought about a life without meat,

    continue reading...


    Kaggle Data Science Solution: Predicting US Census Return Rates

    November 01th, 2012

    The past month two classmates and I have been attacking a new Kaggle contest, Predicting US Census mail return rates. Basically, we were given large amounts of data about block groups, the second smallest unit of division in the US census, and asked to predict what fraction of individuals from the block group would mail back their 2010 census form.

    continue reading...


    Visualizing clusters of stocks

    October 14th, 2012

    One troubling aspect of an estimated covariance matrix is that it always overestimates the true covariance. For example, if two random variables are independent the covariance estimate for the two variables is always non-zero. It will converge to 0, yes, but it may take a really long time.

    What's worse is that the covariance matrix does not understand causality. Consider the certainly common situation below:

    continue reading...


    Twitter Pyschopathy

    October 08th, 2012

    Released earlier was the research paper about predicting psychopathy using Twitter behaviour. Read the paper here.

    continue reading...


    UWaterloo Subway Map

    September 22th, 2012

    I think my thing with subway maps is getting weird. I just created a fictional University of Waterloo subway map using my subway.js library.

    continue reading...


    Sampling from a Conditional Markov Chain

    September 15th, 2012

    My last project involving the artificial creation of "human-generated" passwords required me to sample from a Markov Chain. This is not very difficult, and I'll outline the sampling algorithm below. For the setup, suppose you have a transition probability matrix $M$ and an initial probability vector $\mathbf{v}$. The element $(i,j)$ of $M$ is the probability of the next state being $j$ given that the current state is $i$. The initial probability vector element $i$ is the probability that the first state is $i$. If you have these quantities, then to sample from a realized Markov process is simple:

    continue reading...


    Modeling password creation

    September 14th, 2012

    Creating a password is an embarrassingly difficult task. A password needs to be both memorable and unique enough not to be guessed. The former criterion prevents using randomly generated passwords (try remembering 9st6Uqfe4Z for Gmail, rAEOZmePfT for Facebook, etc.), and the latter is the reason why passwords exist in the first place. So the task falls on humans to create their own passwords and carefully balance these two criteria. This has been, and still is, a bad idea.

    continue reading...


    Eurotrip & Python

    August 13th, 2012

    Later this month, my lovely girlfriend and I are travelling to Amsterdam, Berlin and Kiel. The first half of the trip we will be exploring the tourist and nontourist areas of Amsterdam and Berlin. I'm very excited as I get to spend time drinking and relaxing with my girlfriend. But then...

    continue reading...


    Turn your Android phone into a SMS-based Command Line

    August 11th, 2012

    One of my biggest pet peeves is not having my phone with me. This often occurs if the phone is charging and I need to leave, or I have forgotten it somewhere, or it is lost, or etc. I've created a partial solution.

    continue reading...


    Least Squares Regression with L1 Penalty

    July 31th, 2012

    I want to discuss, and exhibit, a really cool idea in machine learning, optimization and statistics. It's a simple idea: adding a constraint to an optimization problem, specifically a constraint on the sum, can have huge impacts on your interpretation, robustness and sanity. I must first introduce the family of functions we will be discussing.

    The family of L-norm penalty functions, $L_p:R^d \rightarrow R$, is defined: $$ L_p( x ) = || x ||_p = \left( \sum_{i=1}^d |x_i|^p \right) ^{1/p} \;\: p>0 $$ For $p=2$, this is the familar Euclidean distance. The most often used in machine learning literature are the

    continue reading...


    Warrior Dash Data

    July 25th, 2012

    Last Sunday I competed in a pretty epic competition: The Warrior Dash. It's 5k of, well honestly, it's 5k of mostly hills and trail running. Plus spread throughout are some pretty fun obstacles. With only five training workouts un...

    continue reading...


    Subway.js

    July 17th, 2012

    The javascript code that creates and controls the subway map above is available on GitHub. You can build your own using the pretty self-explanatory code + README document. Imagine using the code in a school project or advertising...

    continue reading...


    Kernelized and Supervised Principle Component Analysis

    July 13th, 2012

    Sorry the title is a bit of a mouthful. Everyone in statistics has heard of Principle Components Analysis ( PCA ). The idea is so simple, and a personal favourite of mine, so I'll detail it here.

    continue reading...


    Python Android Scripts

    July 05th, 2012

    I am having a blast messing around with my new Android phone. It has Python! Currently I am playing with the sensors on the phone. Built-in is a light sensor, accelerometer, and an

    continue reading...


    Predicting Psychopathy using Twitter Data

    July 03th, 2012

    The goal of this Kaggle contest was to predict an individuals psychopathic rating using information from their Twitter profile. I was given the already processed data and psychopathic scores. This was the first Kaggle competition I entered, and certainly not the last! If you'll excuse me, I must begin my technical remarks on my solution:

    continue reading...


    CamDP++

    July 03th, 2012

    Camdp.com is my latest attempt to digitize myself. I tried to map the subway lines to mimic my life and work, with each subway line representing a train of thought. I hope you enjoy the continue reading...


    Data Science FAQ

    July 02th, 2012

    What is data science? What is an example of a data set? What are some of the goals of data science? What are some examples of data science in action? continue reading...


    (All Blog Articles).filter( Science )

    Generating exponential survival data

    March 02th, 2014

    TLDR: Suppose we interested in generating exponential survival times with scale parameter $\lambda$, and having $\alpha$ probability of censorship ( $0 < \alpha < 1$. This is actually, at least from what I tried, a non-trivial problem. Here's the algorithm, and below I'll go through what doesn't work to:

    continue...


    Deriving formulas for the expected sample size needed in A/B tests

    December 27th, 2013

    Often an estimate of the number of samples need in an A/B test is asked. Now I've sat down and tried to work out a formula (being disatisfied with other formulas' missing derivations). The below derivation starts off with Bayesian A/B, but uses frequentist methods to derive a single estimate (God help an individual interested in a posterior sample size distribution!)

    continue...


    lifelines: survival analysis in Python

    December 19th, 2013

    The lifelines library provides a powerful tool to data analysts and statisticians looking for methods to solve a common problem:

    How do I predict durations?

    This question seems very vague and abstract, but thats only because we can be so general in this space. Some more specific questions lifelines will help you solve are:

    continue...


    Evolutionary Group Theory

    October 03th, 2013

    We construct a dynamical population whose individuals are assigned elements from an algebraic group \(G\) and subject them to sexual reproduction. We investigate the relationship between the dynamical system and the underlying group and present three dynamical properties equivalent to the standard group properties.

    continue...


    21st Century Problems

    May 16th, 2013

    The technological challenges, and achievements, of the 20th century brought society enormous progress. Technologies like nuclear power, airplanes & automobiles, the digital computer, radio, internet and imaging technologies to name only a handful. Each of these technologies had disrupted the system, and each can be argued to be Black Swans (à la Nassim Taleb). In fact, for each technology, one could find a company killed by it, and a company that made its billions from it.

    continue...


    ML Counterexamples Pt.2 - Regression Post-PCA

    April 26th, 2013

    Principle Component Analysis (PCA), also known as Singular Value Decomposition, is one of the most popular tools in the data scientist's toolbox, and it deserves to be there. The following are just a handful of the uses of PCA:

    • data visualization
    • remove noise
    • find noise (useful in finance)
    • clustering
    • reduce dataset dimension before regression/classification, with minimal negative effect
    continue...


    Machine Learning Counterexamples Pt.1

    April 24th, 2013

    This will the first of a series of articles on some useful counterexamples in machine learning. What is a machine learning counterexample? I am perhaps using the term counterexample loosely, but in this context a counterexample is a hidden gotcha or otherwise a deviation from intuition.

    Suppose you have a data matrix $X$, which has been normalized and demeaned (as appropriate for linear models). A response vector $Y$, also standardized, is regressed on $X$ using your favourite library and the following coefficients, $\beta$, are returned:

    continue...


    Multi-Armed Bandits

    April 06th, 2013

    Suppose you are faced with $N$ slot machines (colourfully called multi-armed bandits). Each bandit has an unknown probability of distributing a prize (assume for now the prizes are the same for each bandit, only the probabilities differ). Some bandits are very generous, others not so much. Of course, you don't know what these probabilities are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings.

    continue...


    An algorithm to sort "Top" Comments

    March 10th, 2013

    Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is not a good reflection of the true value of the product.

    This has created flaws in how we sort items. Many people have realized that sorting online search results by their rating, whether the objects be books, videos, or online comments, return poor results.

    continue...


    My favourite part of The Extended Phenotype

    February 02th, 2013

    To quote directly from the book, by Richard Dawkins:

    continue...


    N is never large.

    January 15th, 2013

    continue...


    The awesome power of Bayesian Methods - Part II - Optimizing Loss Functions

    January 10th, 2013

    Hi again, this article will really show off the flexibility of Bayesian analysis. Recall, Bayesian inference is basically being interested in the new random variables, $\Theta$, distributed by $$ P( \Theta | X ) \propto L( X | \Theta )P(\Theta )$$ where $X$ is observed data, $L(X | \Theta )$ is the likelihood function and P(\Theta) is the prior distribution for $\Theta$. Normally, computing the closed-form formula for the left-hand side of the above equation is difficult, so I say screw closed-forms. If we can sample from $P( \Theta | X )$ accurately, then we can do as much, possibly more, than if we just had the closed-form. For example, by drawing samples from $P( \Theta | X )$, we can estimate the distribution to arbitrary accuracy. Or find expected values for easily using Monte Carlo. Or maximize functions. Or...well I'll get into it.

    continue...


    The awesome power of Bayesian Methods - What they didn't teach you in grad school. Part I

    December 27th, 2012

    For all the things we learned in grad school, Bayesian methods was something that was skimmed over. Strange too, as we learned all the computationally machinery necessary, but we were never actually shown the power of these methods. Let's start our explanation with an example where the Bayesian analysis clearly simply is more correct (in the sense of getting the right answer).

    continue...


    How to bootstrap your way out of biased estimates

    December 06th, 2012

    Bootstrapping is like getting a free lunch, low variance and low bias, by exploiting the Law of Large numbers. Here's how to do it:

    continue...


    High-dimensional outlier detection using statistics

    November 27th, 2012

    I stumbled upon a really cool idea of detecting outliers. Classically, one can plot the data and visually find outliers. but this is not possible in higher-dimensions. A better approach to finding outliers is to consider the distance of each point to some central location. Data points that are unreasonably far away are considered outliers and are dealt with.

    continue...


    Visualizing clusters of stocks

    October 14th, 2012

    One troubling aspect of an estimated covariance matrix is that it always overestimates the true covariance. For example, if two random variables are independent the covariance estimate for the two variables is always non-zero. It will converge to 0, yes, but it may take a really long time.

    What's worse is that the covariance matrix does not understand causality. Consider the certainly common situation below:

    continue...


    Twitter Pyschopathy

    October 08th, 2012

    Released earlier was the research paper about predicting psychopathy using Twitter behaviour. Read the paper here.

    continue...


    Sampling from a Conditional Markov Chain

    September 15th, 2012

    My last project involving the artificial creation of "human-generated" passwords required me to sample from a Markov Chain. This is not very difficult, and I'll outline the sampling algorithm below. For the setup, suppose you have a transition probability matrix $M$ and an initial probability vector $\mathbf{v}$. The element $(i,j)$ of $M$ is the probability of the next state being $j$ given that the current state is $i$. The initial probability vector element $i$ is the probability that the first state is $i$. If you have these quantities, then to sample from a realized Markov process is simple:

    continue...


    Least Squares Regression with L1 Penalty

    July 31th, 2012

    I want to discuss, and exhibit, a really cool idea in machine learning, optimization and statistics. It's a simple idea: adding a constraint to an optimization problem, specifically a constraint on the sum, can have huge impacts on your interpretation, robustness and sanity. I must first introduce the family of functions we will be discussing.

    The family of L-norm penalty functions, $L_p:R^d \rightarrow R$, is defined: $$ L_p( x ) = || x ||_p = \left( \sum_{i=1}^d |x_i|^p \right) ^{1/p} \;\: p>0 $$ For $p=2$, this is the familar Euclidean distance. The most often used in machine learning literature are the

    continue...


    Kernelized and Supervised Principle Component Analysis

    July 13th, 2012

    Sorry the title is a bit of a mouthful. Everyone in statistics has heard of Principle Components Analysis ( PCA ). The idea is so simple, and a personal favourite of mine, so I'll detail it here.

    continue...


    Predicting Psychopathy using Twitter Data

    July 03th, 2012

    The goal of this Kaggle contest was to predict an individuals psychopathic rating using information from their Twitter profile. I was given the already processed data and psychopathic scores. This was the first Kaggle competition I entered, and certainly not the last! If you'll excuse me, I must begin my technical remarks on my solution:

    continue...


    Data Science FAQ

    July 02th, 2012

    What is data science? What is an example of a data set? What are some of the goals of data science? What are some examples of data science in action? continue...


    (All Blog Articles).filter( Coding )

    lifelines: survival analysis in Python

    December 19th, 2013

    The lifelines library provides a powerful tool to data analysts and statisticians looking for methods to solve a common problem:

    How do I predict durations?

    This question seems very vague and abstract, but thats only because we can be so general in this space. Some more specific questions lifelines will help you solve are:

    continue...


    An algorithm to sort "Top" Comments

    March 10th, 2013

    Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is not a good reflection of the true value of the product.

    This has created flaws in how we sort items. Many people have realized that sorting online search results by their rating, whether the objects be books, videos, or online comments, return poor results.

    continue...


    How to solve the Price is Right's Showdown

    February 05th, 2013

    Preface: This example is a (greatly modified) excerpt from the book Probabilistic Programming and Bayesian Methods for Hackers in Python, currently being developed on Github ;)

    How to solve* the Showdown on the Price is Right

    *I use the term loosely and irresponsibly.

    It is incredibly surprising how wild some bids can be on The Price is Right's final game, The Showcase. If you are unfamiliar with how it is played (really?), here's a quick synopsis:

    continue...


    Kaggle Data Science Solution: Predicting US Census Return Rates

    November 01th, 2012

    The past month two classmates and I have been attacking a new Kaggle contest, Predicting US Census mail return rates. Basically, we were given large amounts of data about block groups, the second smallest unit of division in the US census, and asked to predict what fraction of individuals from the block group would mail back their 2010 census form.

    continue...


    Visualizing clusters of stocks

    October 14th, 2012

    One troubling aspect of an estimated covariance matrix is that it always overestimates the true covariance. For example, if two random variables are independent the covariance estimate for the two variables is always non-zero. It will converge to 0, yes, but it may take a really long time.

    What's worse is that the covariance matrix does not understand causality. Consider the certainly common situation below:

    continue...


    UWaterloo Subway Map

    September 22th, 2012

    I think my thing with subway maps is getting weird. I just created a fictional University of Waterloo subway map using my subway.js library.

    continue...


    Modeling password creation

    September 14th, 2012

    Creating a password is an embarrassingly difficult task. A password needs to be both memorable and unique enough not to be guessed. The former criterion prevents using randomly generated passwords (try remembering 9st6Uqfe4Z for Gmail, rAEOZmePfT for Facebook, etc.), and the latter is the reason why passwords exist in the first place. So the task falls on humans to create their own passwords and carefully balance these two criteria. This has been, and still is, a bad idea.

    continue...


    Eurotrip & Python

    August 13th, 2012

    Later this month, my lovely girlfriend and I are travelling to Amsterdam, Berlin and Kiel. The first half of the trip we will be exploring the tourist and nontourist areas of Amsterdam and Berlin. I'm very excited as I get to spend time drinking and relaxing with my girlfriend. But then...

    continue...


    Turn your Android phone into a SMS-based Command Line

    August 11th, 2012

    One of my biggest pet peeves is not having my phone with me. This often occurs if the phone is charging and I need to leave, or I have forgotten it somewhere, or it is lost, or etc. I've created a partial solution.

    continue...


    Subway.js

    July 17th, 2012

    The javascript code that creates and controls the subway map above is available on GitHub. You can build your own using the pretty self-explanatory code + README document. Imagine using the code in a school project or advertising...

    continue...


    Python Android Scripts

    July 05th, 2012

    I am having a blast messing around with my new Android phone. It has Python! Currently I am playing with the sensors on the phone. Built-in is a light sensor, accelerometer, and an

    continue...


    Predicting Psychopathy using Twitter Data

    July 03th, 2012

    The goal of this Kaggle contest was to predict an individuals psychopathic rating using information from their Twitter profile. I was given the already processed data and psychopathic scores. This was the first Kaggle competition I entered, and certainly not the last! If you'll excuse me, I must begin my technical remarks on my solution:

    continue...


    (All Blog Articles).filter( Awesome Stuff )

    Videos about the Bayesian Methods for Hackers project

    August 25th, 2013

    1. New York Tech Meetup, July 2013: This one is about 2/3 the way through, under the header "Hack of the month"

      Available via MLB Media player
    2. PyData Boston, July 2013: Slides available here

      Video available here.
    continue...


    Warrior Dash 2013

    August 03th, 2013

    Warrior dash data, just like last year: continue...


    The Next Steps

    June 16th, 2013

    June has been an exciting month. The opensource book Bayesian Methods for Hackers I am working on blew up earlier this month, propelling it into Github's stratosphere. This is both a good and bad thing: good as it exposes more people to the project, hence more collaborators; bad because it is showing off an incomplete project -- a large fear is that advanced data specialists disparage in favour of more mature works the work to beginner dataists.

    continue...


    NSA Honeypot

    June 08th, 2013

    Let's perform an experiment.

    continue...


    21st Century Problems

    May 16th, 2013

    The technological challenges, and achievements, of the 20th century brought society enormous progress. Technologies like nuclear power, airplanes & automobiles, the digital computer, radio, internet and imaging technologies to name only a handful. Each of these technologies had disrupted the system, and each can be argued to be Black Swans (à la Nassim Taleb). In fact, for each technology, one could find a company killed by it, and a company that made its billions from it.

    continue...


    Cover for Bayesian Methods for Hackers

    March 25th, 2013

    The very kind Stef Gibson created an amazing cover for my open source book Bayesian Methods for Hackers. View it below:

    continue...


    My favourite part of The Extended Phenotype

    February 02th, 2013

    To quote directly from the book, by Richard Dawkins:

    continue...


    Interior Design with Machine Learning

    January 04th, 2013

    While designing my new apartment, I found a very cool use of machine learning. Yes, that's right, you can use machine learning in interior design. As crazy as it sounds, it is completely legitimate.

    continue...


    A more sensible omnivore.

    November 17th, 2012

    My girlfriend, who is a vegetarian, and I often discuss the merits and dismerits of being a vegetarian. Though I am not a vegetarian (though I did experiment with veganism and holistic diets during some One Week Ofs), very much agree that eating as much meat as we do is not optimal.

    Producing an ounce of meat requires a surprising amount of energy, whereas it's return energy is very small. We really only eat meat for its taste. It is strange how often we, the human omnivores, require meat in a meal, less it's not a real meal (and we do this three times a day). And unfortunately, a whole culture eating this way is not sustainable.

    I have often thought about a life without meat,

    continue...


    UWaterloo Subway Map

    September 22th, 2012

    I think my thing with subway maps is getting weird. I just created a fictional University of Waterloo subway map using my subway.js library.

    continue...


    Modeling password creation

    September 14th, 2012

    Creating a password is an embarrassingly difficult task. A password needs to be both memorable and unique enough not to be guessed. The former criterion prevents using randomly generated passwords (try remembering 9st6Uqfe4Z for Gmail, rAEOZmePfT for Facebook, etc.), and the latter is the reason why passwords exist in the first place. So the task falls on humans to create their own passwords and carefully balance these two criteria. This has been, and still is, a bad idea.

    continue...


    Eurotrip & Python

    August 13th, 2012

    Later this month, my lovely girlfriend and I are travelling to Amsterdam, Berlin and Kiel. The first half of the trip we will be exploring the tourist and nontourist areas of Amsterdam and Berlin. I'm very excited as I get to spend time drinking and relaxing with my girlfriend. But then...

    continue...


    Warrior Dash Data

    July 25th, 2012

    Last Sunday I competed in a pretty epic competition: The Warrior Dash. It's 5k of, well honestly, it's 5k of mostly hills and trail running. Plus spread throughout are some pretty fun obstacles. With only five training workouts un...

    continue...


    Subway.js

    July 17th, 2012

    The javascript code that creates and controls the subway map above is available on GitHub. You can build your own using the pretty self-explanatory code + README document. Imagine using the code in a school project or advertising...

    continue...


    CamDP++

    July 03th, 2012

    Camdp.com is my latest attempt to digitize myself. I tried to map the subway lines to mimic my life and work, with each subway line representing a train of thought. I hope you enjoy the continue...