You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using Keras/Tensorflow and the training stalls indefinitely after some epochs when I enable multiprocessing.
It happend only when I used LSTM or TimeDistributed layers. Dense and Conv layers alone don't seem to have this problem.
Without ClearML everything works fine.
To reproduce
Start a training with Tensorflow and multiprocessing enabled.
Choose a model with LSTM and/or TimeDistributed layers.
Hi @n-Guard ! We managed to reproduce this. It is not clear why it happens. In the meantime, you could try calling the following snippet at the very beginning of your script:
What it does: it makes python use spawn instead of fork when creating a new process, so the state of the locks, queues etc will not be copied on the child processes.
Not 100% sure if it will help.
Describe the bug
I'm using Keras/Tensorflow and the training stalls indefinitely after some epochs when I enable multiprocessing.
It happend only when I used LSTM or TimeDistributed layers. Dense and Conv layers alone don't seem to have this problem.
Without ClearML everything works fine.
To reproduce
Start a training with Tensorflow and multiprocessing enabled.
Choose a model with LSTM and/or TimeDistributed layers.
I provided a script, the bug happens mostly within the first 100 epochs:
https://gist.github.com/n-Guard/0f5d568cfedb3a22bfa56785e82961ad
Expected behaviour
The training should continue without getting stuck.
Environment
clearml-agent==1.7.0
clearml==1.14.4
1.14.1
2.15.0
3.11
The text was updated successfully, but these errors were encountered: