Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training gets stuck after some epochs when using Tensorflow with multiprocessing #1230

Open
n-Guard opened this issue Mar 13, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@n-Guard
Copy link

n-Guard commented Mar 13, 2024

Describe the bug

I'm using Keras/Tensorflow and the training stalls indefinitely after some epochs when I enable multiprocessing.
It happend only when I used LSTM or TimeDistributed layers. Dense and Conv layers alone don't seem to have this problem.
Without ClearML everything works fine.

To reproduce

Start a training with Tensorflow and multiprocessing enabled.
Choose a model with LSTM and/or TimeDistributed layers.

I provided a script, the bug happens mostly within the first 100 epochs:
https://gist.github.com/n-Guard/0f5d568cfedb3a22bfa56785e82961ad

Expected behaviour

The training should continue without getting stuck.

Environment

  • Server type: self hosted
  • ClearML SDK Version: clearml-agent==1.7.0 clearml==1.14.4
  • ClearML Server Version: 1.14.1
  • Tensorflow Version: 2.15.0
  • Python Version: 3.11
  • OS: Linux
@n-Guard n-Guard added the bug Something isn't working label Mar 13, 2024
@eugen-ajechiloae-clearml
Copy link
Collaborator

Hi @n-Guard ! We managed to reproduce this. It is not clear why it happens. In the meantime, you could try calling the following snippet at the very beginning of your script:

try:
    import multiprocessing
    multiprocessing.set_start_method("spawn")
except Exception:
    pass

What it does: it makes python use spawn instead of fork when creating a new process, so the state of the locks, queues etc will not be copied on the child processes.
Not 100% sure if it will help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants