Model saving failed after the first epoch #485

wolfshow · 2018-01-04T10:57:14Z

I trained a seq2seq model with 45 million pairs one 4 GPUs. The model was successfully trained for one epoch but crashed during model saving. I would like to know why.

guillaumekln · 2018-01-04T13:12:11Z

Hello,

Can you share the training logs or at least the error message?

jsenellart · 2018-01-27T09:35:27Z

hello @wolfshow - did this happen again? can you share the error message? Were you training async or sync on the 4 GPUs?

jsenellart · 2018-01-27T09:37:05Z

in general, you should considered sampling when dealing with such large input data: one "epoch" will be a subset of your complete dataset (http://opennmt.net/OpenNMT/training/sampling/) - so it will have smaller memory footprint, and you won't risk to lose days of computing.

(but of course it should never crash, but we do need more input here to help)

wolfshow · 2018-01-30T06:40:35Z

Thanks @jsenellart ! Which sampling method do you recommend to use?

jsenellart · 2018-01-30T06:42:36Z

Use file sampling, -gsample N - and don't even bother to put all your files together. In a second step, check -sample_dist option to give sampling rules on your collection of training files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model saving failed after the first epoch #485

Model saving failed after the first epoch #485

wolfshow commented Jan 4, 2018

guillaumekln commented Jan 4, 2018

jsenellart commented Jan 27, 2018

jsenellart commented Jan 27, 2018

wolfshow commented Jan 30, 2018

jsenellart commented Jan 30, 2018

Model saving failed after the first epoch #485

Model saving failed after the first epoch #485

Comments

wolfshow commented Jan 4, 2018

guillaumekln commented Jan 4, 2018

jsenellart commented Jan 27, 2018

jsenellart commented Jan 27, 2018

wolfshow commented Jan 30, 2018

jsenellart commented Jan 30, 2018