Skip to content
This repository has been archived by the owner on Jun 10, 2021. It is now read-only.

Model saving failed after the first epoch #485

Open
wolfshow opened this issue Jan 4, 2018 · 5 comments
Open

Model saving failed after the first epoch #485

wolfshow opened this issue Jan 4, 2018 · 5 comments

Comments

@wolfshow
Copy link

wolfshow commented Jan 4, 2018

I trained a seq2seq model with 45 million pairs one 4 GPUs. The model was successfully trained for one epoch but crashed during model saving. I would like to know why.

@guillaumekln
Copy link
Collaborator

Hello,

Can you share the training logs or at least the error message?

@jsenellart
Copy link
Contributor

hello @wolfshow - did this happen again? can you share the error message? Were you training async or sync on the 4 GPUs?

@jsenellart
Copy link
Contributor

in general, you should considered sampling when dealing with such large input data: one "epoch" will be a subset of your complete dataset (http://opennmt.net/OpenNMT/training/sampling/) - so it will have smaller memory footprint, and you won't risk to lose days of computing.

(but of course it should never crash, but we do need more input here to help)

@wolfshow
Copy link
Author

Thanks @jsenellart ! Which sampling method do you recommend to use?

@jsenellart
Copy link
Contributor

Use file sampling, -gsample N - and don't even bother to put all your files together. In a second step, check -sample_dist option to give sampling rules on your collection of training files

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

3 participants