Reinforcement-Learning

• Applied the concept of reinforcement learning by learning over the performance of a multi-armed bandit problem to predict which arm will be most successful to bet upon using approaches like epsilon greedy, UCB and Thompson sampling.

ε-greedy is a simpler method that focuses more heavily on current wins with a deterministic approach to incorporating more information. A consequence of this approach is that current wins and more information are sometimes not well-balanced (i.e. concentrating on local maxima), but a benefit can be that if the method happens upon a global maximum early on it is more willing to trust it and can produce early-wins.
UCB: This works better than the epsilon greedy because it removes the inefficient exploration (i.e., randomly) and use upper confidence bound.
Thompson Sampling: At each time t, we sample an expected reward, Q~(a), from the prior distribution Beta(αi,βi) for every action. The best action is selected among samples: a(t)=argmax(Q(a)). After the true reward is observed, we can update the Beta distribution accordingly, which is essentially doing Bayesian inference to compute the posterior with the known prior and the likelihood of getting the sampled data.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

Reinforcement-Learning

About

Releases

Packages

robinbeura/Reinforcement-Learning

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

Reinforcement-Learning

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages