Training error due to killed background workers #2180

jonathanjlau-hku · 2024-05-13T04:22:28Z

Hi Dev Team,

After running !CUDA_VISIBLE_DEVICES=0 nnUNetv2_train 1 2d 0 -p nnUNetResEncUNetPlans_40G --npz we keep running into te following error message from os.fork(), warning us of an impending deadlock. This is followed by a runtime error citing a killed background worker.

Would you be able to advise on how we could fix this issue?

Thank you!

(Environment; Colab on GCP VM, CPU memory usage never exceed 25%)

*************************** Console output: ********************

Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-05-13 02:59:22.003947: do_dummy_2d_data_aug: False
2024-05-13 02:59:37.151964: Using splits from existing split file: /.../splits_final.json
2024-05-13 02:59:37.172669: The split file contains 5 splits.
2024-05-13 02:59:37.176580: Desired fold for training: 0
2024-05-13 02:59:37.178374: This split has 1291 training and 323 validation cases.
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
using pin_memory on device 0

Exception in thread Thread-1 (results_loop):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "/usr/local/lib/python3.10/dist-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

AliJ9 · 2024-05-13T18:01:48Z

I'm encountering a similar issue when attempting to execute the new residual encoder with the following command:

nnUNetv2_train Dataset001 3d_fullres 0 -p nnUNetResEncUNetLPlans -num_gpus 4

I have four GPUs. Initially, I attempted:

nnUNetv2_train Dataset001 3d_fullres_bs8 0 -p nnUNetResEncUNetPlans_32G -num_gpus 4

Each GPU has 32GB of RAM. However, despite trying both commands, I encountered errors like:

RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the

actual error message
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised:
RuntimeError: Cannot find a working triton installation. More information on installing Triton can be found at https://github.com/openai/triton

While executing %submod_0 : [num_users=4] = call_module[target=submod_0](args = (%l_x_,), kwargs = {})
Original traceback:
None

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True

Lars-Kraemer · 2024-05-21T08:27:39Z

Hey @jonathanjlau-hku,

Your error seems to be JAX related. Unfortunately, we cannot provide you with support for this here. Can you perhaps run nnUNet in a sutup without JAX?

Best,
Lars

FabianIsensee assigned Lars-Kraemer May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training error due to killed background workers #2180

Training error due to killed background workers #2180

jonathanjlau-hku commented May 13, 2024

AliJ9 commented May 13, 2024 •

edited

Lars-Kraemer commented May 21, 2024

Training error due to killed background workers #2180

Training error due to killed background workers #2180

Comments

jonathanjlau-hku commented May 13, 2024

AliJ9 commented May 13, 2024 • edited

Lars-Kraemer commented May 21, 2024

AliJ9 commented May 13, 2024 •

edited