Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Varying number of arm for single-player problems? #123

Open
pratayaa opened this issue Apr 25, 2018 · 2 comments
Open

Varying number of arm for single-player problems? #123

pratayaa opened this issue Apr 25, 2018 · 2 comments
Assignees
Labels
enhancement I have to improve something which already works not too badly non-stationary For non-stationary bandits simulations question Things I'm not sure how to solve

Comments

@pratayaa
Copy link

pratayaa commented Apr 25, 2018

Want to know the initializing steps for varying number of arm in MEGA

@Naereen Naereen changed the title Want to know the initializing steps for varying number of arm in MEGA, and apologies from my side of asking it in other repository. Varying number of arm for single-player problems? Apr 25, 2018
@Naereen Naereen self-assigned this Apr 25, 2018
@Naereen Naereen added enhancement I have to improve something which already works not too badly question Things I'm not sure how to solve labels Apr 25, 2018
@Naereen
Copy link
Member

Naereen commented Apr 25, 2018

First, let K(t) denote the number of arms at time t. Classic problems are for K(t)=K for all t, and generic problems concern a K(t) that varies and can possibly be different at each time step t.

First, a few specifications and restrictions regarding the model:

  • It will be almost impossible if K(t) is unbounded or with an unknown bound. I suggest to restrict to problems where there is a K_min(=1?), K_max such that for all t, K(t) is in [K_min, …, K_max]. This way, you can store the history of rewards, pulls etc in just a big np.array, no need to do anything fancy.
  • SMPyBandits only support problems with static distributions (nu_1,…,nu_K) to sample the reward r(t)~nu_{A(t)}, and does not support dynamic distributions nu_k(t) changing, neither from time to time nor at each time. So it would be easier to start by adding support for problems with an underlying list of arm distributions (static), nu_{K_min},…,nu_{K_max}, and simply a list of "activated" arms, act(t) sub set of [ K_min,…,K_max], possibly changing at each time. This way, at time t, there is a list act(t) of arms, and a list [ nu_k for k in act(t) ] of distributions from were to sample the rewards r(t).

Directions towards implementing this:

  1. I would write a new MAB class, child of the MAB.DynamicMAB class (maybe called VaryingNBArmsMAB or something like this), that adds a method activated_arms(t) which gives a list of "activated" arms at each time t. Note that DynamicMAB keep the size of problem fixed, but changes the means of arms 1,…,K at each call of newRandomArms(), so it's similar but different.
  2. Then you should write a new Evaluator class, child of Environment.Evaluator (maybe called EvaluatorVaryingNBArms, that uses act_t = env.activated_arms(t) at each time to get this list of activated arm, and pass it to policy.choiceFromSubSet(availableArms=act_t) method of each policy, instead of policy.choice() (yes, this was already implemented, and it's available for most policies! including all index based policies, see here in IndexPolicy.py).
  3. You probably need to write a new Result class, ResultVaryingNBArms, which keeps in memory the set act_t at each time (hint: use a np.array of bools, of size (horizon, K_max)).
  4. Write another main_varyingnbarms simulation script, inspired by main.py, that imports everything, and uses a EvaluatorVaryingNBArms instead of Evaluator.
  5. Then in Policies/YourPolicyOfchoice.py, do the necessary changes to add a method choiceFromSubSet() which makes sense.
  6. First try the simulations without plotting anything nor computing any regret, if it works continues,
  7. You need to think about what mathematical definition to give to the regret in this non-static problems, and implement the getCumulatedRegret method in EvaluatorVaryingNBArms in order to have regret plots.

This steps were if you want to add support for single-player simulations. I would suggest to start to try this.

For multi-player

Then if you want to add support for multi-player as well, more work is needed but it will be similar:

  1. Write a ResultMultiPlayersVaryingNBArms class and a EvaluatorMultiPlayersVaryingNBArms class, that implements the changed explained above,
  2. Then in PoliciesMultiPlayers/YourPolicyOfchoice.py, do the necessary changes to add a method choiceFromSubSet() which makes sense.
  3. Same work for the regret.

Final remarks

It shouldn't be impossible, but all this is not trivial either.

I would say I'm now interested in doing it… but don't have time right now (and not until a few weeks or maybe months!).

@Naereen
Copy link
Member

Naereen commented Apr 25, 2018

Needless to say, please do your best in trying all this, and I will consider pull requests only when working simulations can be run.
(I don't and won't have time to debug or write code with you, sorry.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement I have to improve something which already works not too badly non-stationary For non-stationary bandits simulations question Things I'm not sure how to solve
Development

No branches or pull requests

2 participants