Meta-Learned RL Objective Functions in JAX

GROOVE is the official implementation of the following publications:

Discovering General Reinforcement Learning Algorithms with Adversarial Environment Design, NeurIPS 2023 [ArXiv | NeurIPS | Twitter]
- Learned Policy Gradient (LPG),
- Prioritized Level Replay (PLR),
- General RL Algorithms Obtained Via Environment Design (GROOVE),
- Grid-World environment from the LPG paper.
Discovering Temporally-Aware Reinforcement Learning Algorithms, ICLR 2024 [ArXiv]
- Temporally-Aware LPG (TA-LPG),
- Evolutionary Strategies (ES) with antithetic task sampling.

All scripts are JIT-compiled end-to-end and make extensive use of JAX-based parallelization, enabling meta-training in under 3 hours on a single GPU!

Update (April 2023): Misreported LPG ES hyperparameters in repo + paper, specifically initial learning rate (1e-4 -> 1e-2) and sigma (3e-3 -> 1e-1). Now updated.

Setup | Running experiments | Citation

Setup

Requirements

All requirements are found in setup/, with requirements-base.txt containing the majority of packages, requirements-cpu.txt containing CPU packages, and requirements-gpu.txt containing GPU packages.

Some key packages include:

RL Environments: gymnax
Neural Networks: flax
Optimization: optax, evosax
Logging: wandb

Local installation (CPU)

pip install $(cat setup/requirements-base.txt setup/requirements-cpu.txt)

Docker installation (GPU)

Build docker image

cd setup/docker & ./build_gpu.sh & cd ../..

(To enable WandB logging) Add your account key to setup/wandb_key:

echo [KEY] > setup/wandb_key

Running experiments

Meta-training is executed with python3.8 train.py, with all arguments found in experiments/parse_args.py.

Argument	Description
`--env_mode [env_mode]`	Sets the environment mode (below).
`--num_agents [agents]`	Sets the meta-training batch size.
`--num_mini_batches [mini_batches]`	Computes each update in sequential mini-batches, in order to execute large batches with little memory. RECOMMENDED: lower this to the smallest value that fits in memory.
`--debug`	Disables JIT compilation.
`--log --wandb_entity [entity] --wandb_project [project]`	Enables logging to WandB.

Grid-World environments

Environment mode	Description	Lifetime (# of updates)
`tabular`	Five tabular levels from LPG	Variable
`mazes`	Maze levels from MiniMax	2500
`all_shortlife`	Uniformly sampled levels	250
`all_vrandlife`	Uniformly sampled levels	10-250 (Log-sampled)

Examples

Experiment	Command	Example run (WandB)
LPG (meta-gradient)	`python3.8 train.py --num_agents 512 --num_mini_batches 16 --train_steps 5000 --log --wandb_entity [entity] --wandb_project [project]`	Link
GROOVE	LPG with `--score_function alg_regret` (algorithmic regret is computed every step due to end-to-end compilation, so currently very inefficient)	TBC
TA-LPG	LPG with `--num_mini_batches 8 --train_steps 2500 --use_es --lifetime_conditioning --lpg_learning_rate 0.01 --env_mode all_vrandlife`	TBC

Docker

To execute CPU or GPU docker containers, run the relevant script (with the GPU index as the first argument for the GPU script).

./run_gpu.sh [GPU id] python3.8 train.py [args]

Citation

If you use this implementation in your work, please cite us with the following:

@inproceedings{jackson2023discovering,
    author={Jackson, Matthew Thomas and Jiang, Minqi and Parker-Holder, Jack and Vuorio, Risto and Lu, Chris and Farquhar, Gregory and Whiteson, Shimon and Foerster, Jakob Nicolaus},
    booktitle = {Advances in Neural Information Processing Systems},
    title = {Discovering General Reinforcement Learning Algorithms with Adversarial Environment Design},
    volume = {36},
    year = {2023}
}

@inproceedings{jackson2024discovering,
    author={Jackson, Matthew Thomas and Lu, Chris and Kirsch, Louis and Lange, Robert Tjarko and Whiteson, Shimon and Foerster, Jakob Nicolaus},
    booktitle = {International Conference on Learning Representations},
    title = {Discovering Temporally-Aware Reinforcement Learning Algorithms},
    volume = {12},
    year = {2024}
}

Coming soon

Speed up GROOVE by removing recomputation of algorithmic regret every step.
Meta-testing script for checkpointed models.
Alternative UED metrics (PVL, MaxMC).

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
agents		agents
environments		environments
experiments		experiments
meta		meta
models		models
setup		setup
util		util
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
run_cpu.sh		run_cpu.sh
run_gpu.sh		run_gpu.sh
train.py		train.py

License

EmptyJackson/groove

Folders and files

Latest commit

History

Repository files navigation

Meta-Learned RL Objective Functions in JAX

Setup

Requirements

Local installation (CPU)

Docker installation (GPU)

Running experiments

Grid-World environments

Examples

Docker

Citation

Coming soon

About

Resources

License

Stars

Watchers

Forks

Languages