Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require HEAD version of universe #22

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AdamStelmaszczyk
Copy link
Contributor

Without it, visualizations will not work properly, see issue: openai/universe-starter-agent#133

Without it, visualizations will not work properly, see issue: openai/universe-starter-agent#133
@pathak22
Copy link
Owner

pathak22 commented Feb 17, 2018

Oh, it has been updated -- thanks for spotting it. Are you sure that changing the version of universe doesn't break other things?

@AdamStelmaszczyk
Copy link
Contributor Author

Oh, it has been updated -- thanks for spotting it.

I'm not 100% sure what you are referring to, my feeling is that to universe.

The universe commit which fixes visualizations for Python 2.7 is from March 2017 :)

However, it seems that universe and universe-starter-agent projects are deprecated and have no support. Last commits to universe and universe-starter-agent were in May 2017, nobody answers to PR or Issues.

Are you sure that changing the version of universe doesn't break other things?

Unfortunately, I'm not sure about that. To be sure I'd need to reproduce your Mario and Doom results - that would take some effort.

Let me tell you why I came to noreward-rl on the first place. I wanted to run it on Montezuma and see how it works (was hoping it would help in exploration). So, I ran it and the scores were close to 0 (for Montezuma that's not surprising:)). But I thought, this isn't fair a test - the hyperparams are probably wrong (I took them from Mario), so perhaps the ICM didn't perform as it could.

So, I tried to run noreward-rl on Breakout and Seaquest (where the score is >0) from Atari (yes, I saw this question). The scores were way too low. I thought - ok, noreward-rl is basing on A3C from universe-starter-agent. So, let's first reproduce that, let's find correct hyperparams for vanilla A3C. I tried, but I couldn't reproduce the A3C results. I searched, but other people also couldn't reproduce it on other Atari games except Pong. It seems to me there are 2 options:

  1. universe-starter-agent correctly implements A3C, it's just that nobody used the correct hyperparameters - I searched for them, like many others, but nobody found working ones.

  2. universe-starter-agent incorrectly implements A3C. There is one or more bugs that make the results on Atari games (except Pong) much lower than the original A3C results. That case would be unfortunate for noreward-rl.

Let me ask you possibly a "difficult" question, sorry for that :) But it would shed a lot of light:

Did you make sure universe-starter-agent is working correctly before using it - have you or somebody else reproduced Atari results with it? If yes, what hyperparams are needed? Or do you know which hyperparams I need to tune? I could run some hyperparams grid search.

@pathak22
Copy link
Owner

Alright, let me answer your "difficult" question since I myself went through these steps 1.5years back.

I guess universe-starter-agent has correct implementation of A3C but definitely with quite a few design changes, e.g., unshared optimizer across workers, different hyper-parameters like input size, learning rate etc., and different network architectures etc. I first "tuned" it to make sure I can reproduce ATARI results to some extent (note: it's quite hard to replicate original paper results because they use Torch and initialization was different -- training is sensitive). I could reach close to the results for "breakout" and few other games in "non-shared optimizer scenario" (see original A3C paper supplementary) but did not get exactly same numbers because of difference in initialization, Tensorflow vs. Torch etc. By the word "tuning" above I meant: changing architecture, changing loss equation to mean loss and not the total loss, changing hyper-parameters etc.

But since the main goal of our curiosity-driven work was to explore policy generalization across environments and levels, e.g., novel levels in Mario and novel mazes in VizDoom. Since generalization was not possible to test in ATARI games, we decided to use VizDoom and I then "tune" (see definition of tuning above) the baseline A3C on VizDoom to get the best reward (it could get the maximum reward -- see results in curiosity paper) on DoomMyWayHome dense reward game. All the hyper-parameters you see in noreward-rl are basically based on that tuning. I never tuned my curiosity model (except the coefficient between forward and inverse loss), and all the hyper-parameters were taken from best performing baseline A3C on DoomMyWayHome dense reward game. Hence, all the comparisons were fair and rather to the advantage of the baseline A3C.

Hope, it answers your question! :-)

@AdamStelmaszczyk
Copy link
Contributor Author

It does, thanks a lot :)

Because you mentioned that the original A3C work used Torch (which I didn't know) I googled more and found the hyperparams they used. Also good to know about universe-starter-agent design changes (which makes everything harder).

Would you expect noreward-rl improve the results on Montezuma over universe-starter-agent or not?

@pathak22
Copy link
Owner

You can try tuning the hyper-parameters of state-predictor version of curiosity from the noreward-rl codebase on the Montezuma game. That definitely has higher chances of vanilla universe-starter-agent as the latter does not contain any exploration incentive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants