Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DistributedRL training - Loss value is so high and not coming down #87

Open
kalum84 opened this issue Jan 23, 2019 · 1 comment
Open

Comments

@kalum84
Copy link

kalum84 commented Jan 23, 2019

Problem description

The loss values are so high and not coming down over time.

Problem details

We are trying to create a racing environment and use reinforcement learning to train a model to do racing. So we started from this example. We wanted to test how much time it needs to train a model and how fat it can reach.
I used the same parameters in the example. Except following one

   max_epoch_runtime_sec = 30

Also didn't change the code.
I attached the output file from one agent. Please help me to troubleshoot what the issue is.

Experiment/Environment details

Used existing weights to start with.
Started training on Azure with 6 NV6 machines. 5 agents and the trainer.
While running the job I restarted the agents after some time. (After 12h)
Then run the training for another 20h
agent1.txt

@kalum84 kalum84 changed the title Loss value is so high and not coming down DistributedRL training - Loss value is so high and not coming down Jan 23, 2019
@mitchellspryn
Copy link
Contributor

We discussed a bit offline, but this paper might be of interest to you.

The algorithm as written does not infinitely scale. Try 3 or 4 machines.

Also, the model will overfit - there is not concept of early stopping. Try checking back on it after an hour or an hour and a half.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants