BTech Exploratory Project

Under the guidance of Dr. Lakshmanan Kailasam, IIT BHU

Made with -

Submission Date : 10th May 2021

Team Members

Satya Prakash : Roll No - 19075066
Nikita : Roll No - 19075048

Presentation Link - PPT for Exploratory Project

Analysing Multi Armed Bandit Algorithms on different Probability Distributions

Algorithms analysed : $\varepsilon$ -greedy, UCB1 and UCBV
Probability Distributions : Normal and Heavy Tailed distributions

What is a multi armed bandit problem ?

It is a classical reinforcement learning problem. To understand it, let us consider a situation where we have to select an arm out of the k-possibilities over a period of time, which have random reward values ( associated to the actions ) that is drawn from a probability distribution which is unknown. Our goal is to maximize the expected total reward obtained over a period of time.

Motivation

There are many real life applications of multi armed bandit problem today like A/B testing of online advertisements, sequential portfolio selection, recommendation system used by digital platforms to offer personalized products to its users and testing the effects of various medical treatments in clinical trials.
As we can see from the above applications, the problem depends on creating a balance between exploting the knowledge already obtained and exploring the new actions, so as to increase our knowledge.

Algorithms Implemented

1. $\varepsilon$ -Greedy Algorithm :

It involves selecting the arm that has the highest mean reward with a probability of ( 1 - $\varepsilon$ ) - which is most of the time, and selecting any random arm with a probability of $\varepsilon$ any random arm out of all the possible arms with equal probability. It is the most basic and easy to implement algorithm.

2. UCB Family of Algorithms:

As in the case of $\varepsilon$ -Greedy algorithm, all the arms are treated equally during exploration, and we might end up exploring a bad arm, which we had previously confirmed. We would like to propose a solution that explores those arms that we are more uncertain about. UCB algorithm tries to define the upper bound of the mean reward value so that the actual value of the mean reward would always be less than the defined upper bound.

3. UCB1 Algoritm :

Exploration term is defined as : $\sqrt{ \frac{2ln(t)}{N_{t}(a)} }$
The exploration factor is not entirely random but it is dependent upon number of times an arm has been selected ( $N_{t}(a)$ ) and time steps ( $\ln(t)$ ).

In this case, expected cumulative regret grows logarithmically with time steps. Most of the regret occurs during the initial phase of exploration and after some time, it becomes nearly flat.

4. UCB3 Algorithm :

It is a modified version of $\varepsilon$ -greedy algorithm.
It selects the arm with highest mean reward with a probability of $(1- \sum_{j=1}^{K}&space;\varepsilon_{j}^n)$ and selects any random arm with a probability of $\varepsilon_{j}^n$ .
Unlike $\varepsilon$ -greedy algorithm, in this case the exploration factor depends on the arm.

$\varepsilon_{j}^n : \min( 1/k, c/d^{2}n)$

5. UCBV Algorithm :

The uncertainity is caused by variance of the distribution.
UCBV algorithm takes into account the factor of variance of the rewards in the exploration term. Exploration term in this case is defined as : $\sqrt{ \frac{2\theta V_{j,n_{j}} \log(t)}{N_{t}(a)} } + \frac{3 \theta log(t)}{N_{t}(a)}$
An advantage of UCBV over other algorithms that do not involve sample variance is that the regret bound involves $\sigma_{j}^2/\Delta_{j}$ instead of $1/\Delta_{j}$ .

Experiment 1 :

Dataset : Normal distribution with $\mu = 0$ and $\sigma^{2} = 0$
K : 10
Total time steps : 1000
Number of times algorithm run : 2000
We calculated the average reward and optimal action percentage.
Refer the graphs below :

Average rewards	Optimal action percentage

Experiment 2 :

Dataset : Heavy-tailed distribution with $\mu = 0$
K : 10
Total time steps : 1000
Number of times algorithm run : 2000
We calculated the average reward and optimal action percentage.
Refer the graphs below :

Average rewards	Optimal action percentage

Result :

From both the experiments we found out that UCBV performed better than UCB1 which performed better than $\varepsilon$ -greedy algorithm.

Conclusion :

The report has clearly shown that the performance of these algorithms is similar on normal and heavy-tailed distribution and does not depend on the true mean reward values of their arms. So, we can clearly say that the relative behaviour of these algorithms will not change with change in the reward distribution of the data.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Graphs		Graphs
ExploratoryReport.pdf		ExploratoryReport.pdf
K_Armed_Bandit_Problem.ipynb		K_Armed_Bandit_Problem.ipynb
LICENSE		LICENSE
Project.py		Project.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graphs

Graphs

ExploratoryReport.pdf

ExploratoryReport.pdf

K_Armed_Bandit_Problem.ipynb

K_Armed_Bandit_Problem.ipynb

LICENSE

LICENSE

Project.py

Project.py

README.md

README.md

Repository files navigation

BTech Exploratory Project

Made with -

Submission Date : 10th May 2021

Team Members

Analysing Multi Armed Bandit Algorithms on different Probability Distributions

What is a multi armed bandit problem ?

Motivation

Algorithms Implemented

1. $\varepsilon$ -Greedy Algorithm :

2. UCB Family of Algorithms:

3. UCB1 Algoritm :

4. UCB3 Algorithm :

5. UCBV Algorithm :

Experiment 1 :

Experiment 2 :

Result :

Conclusion :

About

Releases

Packages

Languages

License

n4i9kita/ExploratoryProject

Folders and files

Latest commit

History

Repository files navigation

BTech Exploratory Project

Made with -

Submission Date : 10th May 2021

Team Members

Analysing Multi Armed Bandit Algorithms on different Probability Distributions

What is a multi armed bandit problem ?

Motivation

Algorithms Implemented

1. -Greedy Algorithm :

2. UCB Family of Algorithms:

3. UCB1 Algoritm :

4. UCB3 Algorithm :

5. UCBV Algorithm :

Experiment 1 :

Experiment 2 :

Result :

Conclusion :

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

1. $\varepsilon$ -Greedy Algorithm :