DistributedRL training - Loss value is so high and not coming down #87

kalum84 · 2019-01-23T14:06:01Z

Problem description

The loss values are so high and not coming down over time.

Problem details

We are trying to create a racing environment and use reinforcement learning to train a model to do racing. So we started from this example. We wanted to test how much time it needs to train a model and how fat it can reach.
I used the same parameters in the example. Except following one

   max_epoch_runtime_sec = 30

Also didn't change the code.
I attached the output file from one agent. Please help me to troubleshoot what the issue is.

Experiment/Environment details

Used existing weights to start with.
Started training on Azure with 6 NV6 machines. 5 agents and the trainer.
While running the job I restarted the agents after some time. (After 12h)
Then run the training for another 20h
agent1.txt

The text was updated successfully, but these errors were encountered:

mitchellspryn · 2019-01-30T07:53:13Z

We discussed a bit offline, but this paper might be of interest to you.

The algorithm as written does not infinitely scale. Try 3 or 4 machines.

Also, the model will overfit - there is not concept of early stopping. Try checking back on it after an hour or an hour and a half.

kalum84 changed the title ~~Loss value is so high and not coming down~~ DistributedRL training - Loss value is so high and not coming down Jan 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DistributedRL training - Loss value is so high and not coming down #87

DistributedRL training - Loss value is so high and not coming down #87

kalum84 commented Jan 23, 2019 •

edited

mitchellspryn commented Jan 30, 2019

DistributedRL training - Loss value is so high and not coming down #87

DistributedRL training - Loss value is so high and not coming down #87

Comments

kalum84 commented Jan 23, 2019 • edited

Problem description

Problem details

Experiment/Environment details

mitchellspryn commented Jan 30, 2019

kalum84 commented Jan 23, 2019 •

edited