Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETA calculation is inaccurate #55

Open
3 tasks
felker opened this issue Jan 7, 2020 · 0 comments
Open
3 tasks

ETA calculation is inaccurate #55

felker opened this issue Jan 7, 2020 · 0 comments

Comments

@felker
Copy link
Member

felker commented Jan 7, 2020

Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse):

[0] step: 0 [ETA: 468568011.02s] [0.00/1789], loss: 1.05701 [1.05701] | walltime: 5.7374 | 8.47E+02 Examples/sec | 6.04E-01 sec/batch [92.3% calc., 7.7% sync.][batch = 512 = 128*4] [lr = 7.30E-05 = 1.83E-05*4]

The ETA provided in this example is clearly inaccurate (each epoch takes around 60s). Specifically, there are two types of issues:

  1. The ETA computed in the first step of any epoch is always inaccurate.
  2. For later epochs within a session, the ETA increases nearly monotonically for many steps before starting to decrease nearly monotonically.

First step

For the first epoch in a given session, it gives a huge ETA since MPI_Model.num_so_far is zero, resulting in work_so_far of 0 being passed to:

def estimate_remaining_time(self, time_so_far, work_so_far, work_total):
eps = 1e-6
total_time = 1.0*time_so_far*work_total/(work_so_far + eps)
return total_time - time_so_far

causing total_time to explode.

  • Probably should just refuse to give an ETA for the first step (or steps) of the first epoch

For later epochs within a session, it gives a minuscule ETA:

step: 0 [ETA: 0.55s] [1819.00/1789], loss: 0.98688 [0.98688] | walltime: 174.4240 | 8.93E+02 Examples/sec | 5.73E-01 sec/batch [96.1% calc., 3.9% sync.][batch = 512 = 128*4] [lr = 7.08E-05 = 1.77E-05*4]
  • I think an error was introduced when I changed the 0-based indexing of the epochs 1-2 months ago.

Later steps in later epochs

E.g. here are the ETAs for some later epoch:


ETA: 0.55s
ETA: 22.14
ETA: 27.98
ETA: 31.63
ETA: 35.88
ETA: 38.45
ETA: 34.89
ETA: 36.21
ETA: 35.35
ETA: 35.56
ETA: 36.04
ETA: 35.88
ETA: 35.33
ETA: 34.49
ETA: 34.73
ETA: 34.29
ETA: 34.13
ETA: 33.51
ETA: 33.16
…
ETA: 1.35s
ETA: 1.06s
ETA: 0.67s
ETA: 0.11s
ETA: -0.45
  • Consider using the measured runtimes of the previous epochs within this session to inform the ETA in later epochs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant