Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lunar Lander Example is Seriously Regressed #256

Open
ntraft opened this issue Nov 21, 2022 · 1 comment
Open

Lunar Lander Example is Seriously Regressed #256

ntraft opened this issue Nov 21, 2022 · 1 comment

Comments

@ntraft
Copy link

ntraft commented Nov 21, 2022

Description

I just found out that the original version of the Lunar Lander example was able to land successfully sometimes. In the current code, it never even gets remotely close. It can't even get a positive score.

The original code says:

# This is a work in progress, and currently takes ~100 generations to
# find a network that can land with a score >= 200 at least a couple of
# times.  It has yet to solve the environment

In the current code, I can run it for 500+ generations without it ever cresting above 0 reward. So something has seriously regressed. In reading the code, I now realize that the compute_fitness function makes no sense to me, so I believe there is some issue in confusing rewards with outputs. Also, the actual scores obtained when running the networks afterward are nowhere near the "fitness" being plotted. So this also points to there being a complete disconnect between "fitness" and actual score.

I will be debugging this in the next couple of days, but wanted to report the issue ahead of time.

To Reproduce

Steps to reproduce the behavior:

  1. cd examples/openai-lander
  2. python evolve.py
  3. See a fitness.svg plot like the one below. We can't achieve a positive reward (solving the task would be a reward of +200).

fitness

@ntraft
Copy link
Author

ntraft commented Nov 27, 2022

Just thought I should leave an update on this issue...

Things I've learned:

  1. The Lander example was never really working well.
    • I'm actually not sure if I understand the author's comment which I quoted above. But when I run that version of the code, I see that the networks in general perform terribly, and in 100 generations there will be one or two times when the thing accidentally and miraculously gets 200+ points. However, the behavior is extremely random, and does not indicate to me that the problem was "learned" at all.
  2. The example is trying to learn by doing reward prediction instead of using the reward directly as fitness.
    • I believe this has serious pitfalls. For example, if the lander doesn't fire its engine, then it (often) doesn't get any penalty. So a great way to score well at "reward prediction" is to predict reward of no action to be 0. I think there are probably other weird feedback loops and types of mode collapse like this.
    • I think this is made obvious by the plot I posted above. We quickly converge to 0 reward prediction error, but this doesn't help us whatsoever to actually solve the environment. When we look at the actual simulation scores, we're doing just as poorly as at the start of the simulation.
  3. This commit is the one which regressed the example even further.
    • Before, fitness was a combination of overall score and reward prediction error. In this commit, it was changed to be only reward prediction. The comment describing the "composite fitness" was not changed, so it's not clear whether the change was accidental.
    • The new format of the evaluation prevents using actual score, and fitness can only be derived from reward prediction.
  4. In general, Lunar Lander is probably going to be a very hard problem for NEAT. We receive basically no rewards until we land. It's extremely hard to discover this landing action by accident.

Actions I'm taking in response to this:

  1. Refactoring the example to be able to run on more different types of Gym environments, so we can try it on something easier (but not as easy as cart-pole, which it seems to crush with almost no effort).
  2. Refactoring to restore the original composite fitness formula, and be able to configure how fitness is computed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant