training slower and slower #33

Sander-houqi · 2021-11-04T07:33:39Z

hi:
thanks for you great job!
I have a problem for the training , when I trained use in two 32g/v100, i found training time is slower in several step, and I
found when normal speed multiprocessing spawn have two process, and when time increase one process would be kill, and i
can't find why the process be killed, the ps -ef stat is Sl+, cpu memroy and gpu memory is sufficient,and I try to decrease batch_size to 256 , still can't solve。
the slow code is inference:
output, x_norm = model(input, target)
How do I need to deal with this problem?

Sander-houqi · 2021-11-04T08:43:33Z

i solved by pytorch/pytorch#1355
ulimit -n 500000
and set num_works=8 not 1

Sander-houqi closed this as completed Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training slower and slower #33

training slower and slower #33

Sander-houqi commented Nov 4, 2021 •

edited

Sander-houqi commented Nov 4, 2021

training slower and slower #33

training slower and slower #33

Comments

Sander-houqi commented Nov 4, 2021 • edited

Sander-houqi commented Nov 4, 2021

Sander-houqi commented Nov 4, 2021 •

edited