Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDP Agents on Bandit Tasks #101

Open
abrahamnunes opened this issue Jul 4, 2018 · 2 comments
Open

MDP Agents on Bandit Tasks #101

abrahamnunes opened this issue Jul 4, 2018 · 2 comments

Comments

@abrahamnunes
Copy link
Owner

It is difficult to use an MDP agent on a Bandit task, mainly because of the eligibility trace update.

On a contextual 2 armed bandit task, the final action is $\mathbf u' = (0.5, 0.5)^\top$. The 0.5's are necessary in order to facilitate computation of the target $y_t = r_t - \mathbf u'^\top \mathbf Q \mathbf x'$ such that

equation

However, the eligibility trace is updated as

equation

which in a 4 state (2 context, 2 outcome) task with $\lambda = \gamma = 1$, and where $\mathbf x = (1, 0, 0, 0)^\top$, $\mathbf u = (1, 0)^\top$ and $\mathbf x' = (0, 0, 1, 0)^\top$, should result in a trace that looks like

equation

The current setup will allow either the correct trace or the correct target calculation.

I think the solution may be to separate the trace updating function from the value function updating.

@ARudiuk
Copy link
Collaborator

ARudiuk commented Jul 4, 2018

Some of the math seems to not be rendering @abrahamnunes

@hardik44fg
Copy link

@abrahamnunes Try to highlight the important words so it will help someone to easily understand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants