• Applied the concept of reinforcement learning by learning over the performance of a multi-armed bandit problem to predict which arm will be most successful to bet upon using approaches like epsilon greedy, UCB and Thompson sampling.
- ε-greedy is a simpler method that focuses more heavily on current wins with a deterministic approach to incorporating more information. A consequence of this approach is that current wins and more information are sometimes not well-balanced (i.e. concentrating on local maxima), but a benefit can be that if the method happens upon a global maximum early on it is more willing to trust it and can produce early-wins.
- UCB: This works better than the epsilon greedy because it removes the inefficient exploration (i.e., randomly) and use upper confidence bound.
- Thompson Sampling: At each time t, we sample an expected reward, Q~(a), from the prior distribution Beta(αi,βi) for every action. The best action is selected among samples: a(t)=argmax(Q(a)). After the true reward is observed, we can update the Beta distribution accordingly, which is essentially doing Bayesian inference to compute the posterior with the known prior and the likelihood of getting the sampled data.