In this challenge your goal is to train an RL agent to solve a personalization task that is simulated as a synthetic contextual bandit.
In the given environment the state space is continuous and is reperesented by a 100
-dimensional hypercube, and the action space is discrete and consists of 100
fixed 100
-dimensional vectors. The reward signal is a bit convoluted in its nature but it's intentionally made this way to mimic the human behavioral preferencing (and we can discuss what it actually means or how it's designed in more detail).
The rationale behind such an environment is the following: the set of available actions represents the possible recommendations and the observed states are the parameterized representations of the persons to whom the recommendations are provided; for each state-action pair the received reward value represents the suitability of the provided recommendation for the given person with 1
being the best recomendation and -1
being the worst.
Install the dependancies with pip install -r requirements.txt
, then run with python main.py
.
Essentially this simple repository consists of the following files:
environment.py
--- contains the classSampleEnv
that creates an OpenAI Gym contextual bandit environmentmain.py
--- trains a policy gradient agent, serving as a basic baseline --- modify this file to implement and train your agent
The generated SampleEnv
environment inherits from gym.Env
and, as such, has the following methods:
reset()
--- observe a new statestep(action)
--- take an action and return the result
The above methods are technically sufficient to solve the environment. Other useful methods include
evaluate_agent(agent)
--- compute the deterministic performance of the agent's policy on the environmentrestart()
--- fully recreate the environment; should be called between the training of different agents for reproducibilityobserve(num=1)
--- observe new states; identical toreset
but can sample multiple states (num
) simultaneouslycompute_reward(s,a_ind)
--- compute the normalized reward for a states
and an action indexa_ind
compute_reward_raw(s,a)
--- compute the un-normalized reward value of a state-action pair(s,a)
print_action_histogram()
--- print the histogram of the optimal actions; ideally an agent should provide a similar histogram
By default the reward values returned by the environment are normalized, i.e. the optimal reward for any state s
is 1
and the average reward is 0
.
Hence any sensible agent should achieve a positive return and the optimal agent has the return of 1
.
For example, the current baseline agent achieves a performance score of 0.2318
.
The intended outcome is to train an agent that demonstrates a good performance, e.g. > 0.8
or so.
If you manage to obtain such an agent, please let me know!