Skip to content

Latest commit

 

History

History
56 lines (44 loc) · 2.12 KB

HER.md

File metadata and controls

56 lines (44 loc) · 2.12 KB

Hindsight Experience Replay

Some notes on the implementation of Hindsight Experience Replay.

Expected Results

If you run the Fetch example, then you should get results like this: Fetch HER results

If you run the GridWorld example , then you should get results like this: HER Gridworld results

Note that these examples use HER combined with DQN and SAC, and not DDPG.

These plots are generated using viskit.

Goal-based environments and ObsDictRelabelingBuffer

See here.

Implementation Difference

This HER implemention is slightly different from the one presented in the paper. Rather than relabeling goals when saving data to the replay buffer, the goals are relabeled when sampling from the replay buffer.

In other words, HER in the paper does this:

Data collection
    1. Sample $(s, a, r, s', g) ~ \text\{ENV}$.
    2. Save $(s, a, r, s', g)$ into replay buffer $\mathcal B$.
    For i = 1, ..., K:
        Sample $g_i$ using the future strategy.
        Recompute rewards $r_i = f(s', g_i)$.
        Save $(s, a, r_i, s', g_)$ into replay buffer $\mathcal B$.
Train time
    1. Sample $(s, a, r, s', g)$ from replay buffer
    2. Train Q function $(s, a, r, s', g)$

The implementation here does:

Data collection
    1. Sample $(s, a, r, s', g) ~ \text\{ENV}$.
    2. Save $(s, a, r, s', g)$ into replay buffer $\mathcal B$.
Train time
    1. Sample $(s, a, r, s', g)$ from replay buffer
    2a. With probability 1/(K+1):
        Train Q function $(s, a, r, s', g)$
    2b. With probability 1 - 1/(K+1):
        Sample $g'$ using the future strategy.
        Recompute rewards $r' = f(s', g')$.
        Train Q function on $(s, a, r', s', g')$

Both implementations effective do the same thing: with probability 1/(K+1), you train the policy on the goal used during rollout. Otherwise, train the policy on a resampled goal.