Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training error due to killed background workers #2180

Open
jonathanjlau-hku opened this issue May 13, 2024 · 2 comments
Open

Training error due to killed background workers #2180

jonathanjlau-hku opened this issue May 13, 2024 · 2 comments
Assignees

Comments

@jonathanjlau-hku
Copy link

Hi Dev Team,

After running !CUDA_VISIBLE_DEVICES=0 nnUNetv2_train 1 2d 0 -p nnUNetResEncUNetPlans_40G --npz we keep running into te following error message from os.fork(), warning us of an impending deadlock. This is followed by a runtime error citing a killed background worker.

Would you be able to advise on how we could fix this issue?

Thank you!

(Environment; Colab on GCP VM, CPU memory usage never exceed 25%)

*************************** Console output: ********************

Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-05-13 02:59:22.003947: do_dummy_2d_data_aug: False
2024-05-13 02:59:37.151964: Using splits from existing split file: /.../splits_final.json
2024-05-13 02:59:37.172669: The split file contains 5 splits.
2024-05-13 02:59:37.176580: Desired fold for training: 0
2024-05-13 02:59:37.178374: This split has 1291 training and 323 validation cases.
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
using pin_memory on device 0

Exception in thread Thread-1 (results_loop):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "/usr/local/lib/python3.10/dist-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

@AliJ9
Copy link

AliJ9 commented May 13, 2024

I'm encountering a similar issue when attempting to execute the new residual encoder with the following command:

nnUNetv2_train Dataset001 3d_fullres 0 -p nnUNetResEncUNetLPlans -num_gpus 4

I have four GPUs. Initially, I attempted:

nnUNetv2_train Dataset001 3d_fullres_bs8 0 -p nnUNetResEncUNetPlans_32G -num_gpus 4

Each GPU has 32GB of RAM. However, despite trying both commands, I encountered errors like:

RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the

actual error message
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised:
RuntimeError: Cannot find a working triton installation. More information on installing Triton can be found at https://github.com/openai/triton

While executing %submod_0 : [num_users=4] = call_module[target=submod_0](args = (%l_x_,), kwargs = {})
Original traceback:
None

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True

@Lars-Kraemer
Copy link
Member

Hey @jonathanjlau-hku,

Your error seems to be JAX related. Unfortunately, we cannot provide you with support for this here. Can you perhaps run nnUNet in a sutup without JAX?

Best,
Lars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants