possible deadlock in dataloader #1355

zym1010 · 2017-04-25T19:21:31Z

the bug is described at pytorch/examples#148. I just wonder if this is a bug in PyTorch itself, as the example code looks clean to me. Also, I wonder if this is related to #1120.

apaszke · 2017-04-25T20:53:57Z

How much free memory do you have when the loader stops?

zym1010 · 2017-04-25T21:02:35Z

@apaszke if I check top, the remaining memory (cached mem also counts as used) is usually 2GB. But if you don't count cached as used, it's always a lot, say 30GB+.

zym1010 · 2017-04-25T21:03:42Z

Also I don't understand why it always stops at beginning of validation, but not everywhere else.

ngimel · 2017-04-25T21:40:27Z

Possibly because for validation a separate loader is used that pushes the use of shared memory over the limit.

zym1010 · 2017-04-25T21:52:55Z

@ngimel

I just ran the program again. And got stuck.

Output of top:

top - 17:51:18 up 2 days, 21:05,  2 users,  load average: 0.49, 3.00, 5.41
Tasks: 357 total,   2 running, 355 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.9 us,  0.1 sy,  0.7 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  65863816 total, 60115084 used,  5748732 free,  1372688 buffers
KiB Swap:  5917692 total,      620 used,  5917072 free. 51154784 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                              3067 aalreja   20   0  143332 101816  21300 R  46.1  0.2   1631:44 Xvnc
16613 aalreja   30  10   32836   4880   3912 S  16.9  0.0   1:06.92 fiberlamp                            3221 aalreja   20   0 8882348 1.017g 110120 S   1.3  1.6 579:06.87 MATLAB
 1285 root      20   0 1404848  48252  25580 S   0.3  0.1   6:00.12 dockerd                             16597 yimengz+  20   0   25084   3252   2572 R   0.3  0.0   0:04.56 top
    1 root      20   0   33616   4008   2624 S   0.0  0.0   0:01.43 init

Output of free

yimengzh_everyday@yimengzh:~$ free
             total       used       free     shared    buffers     cached
Mem:      65863816   60122060    5741756    9954628    1372688   51154916
-/+ buffers/cache:    7594456   58269360
Swap:      5917692        620    5917072

Output of nvidia-smi

yimengzh_everyday@yimengzh:~$ nvidia-smi
Tue Apr 25 17:52:38 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 30%   42C    P8    14W / 250W |   3986MiB /  6082MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          Off  | 0000:81:00.0     Off |                  Off |
|  0%   46C    P0    57W / 235W |      0MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     16509    C   python                                        3970MiB |
+-----------------------------------------------------------------------------+

I don't think it's a memory issue.

apaszke · 2017-04-25T22:30:26Z

There are separate limits for shared memory. Can you try ipcs -lm or cat /proc/sys/kernel/shmall and cat /proc/sys/kernel/shmmax? Also, does it deadlock if you use fewer workers (e.g. test with the extreme case of 1 worker)?

zym1010 · 2017-04-25T22:32:54Z

@apaszke

yimengzh_everyday@yimengzh:~$ ipcs -lm

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18446744073642442748
min seg size (bytes) = 1

yimengzh_everyday@yimengzh:~$ cat /proc/sys/kernel/shmall
18446744073692774399
yimengzh_everyday@yimengzh:~$ cat /proc/sys/kernel/shmmax
18446744073692774399

how do they look for you?

as for fewer workers, I believe it won't happen that often. (I can try now). But I think in practice I need that many workers.

apaszke · 2017-04-25T22:43:59Z

You have a max of 4096 shared memory segments allowed, maybe that's an issue. You can try increasing that by writing to /proc/sys/kernel/shmmni (maybe try 8192). You may need superuser privileges.

zym1010 · 2017-04-25T22:59:28Z

@apaszke well these are default values by both Ubuntu and CentOS 6... Is that really an issue?

zym1010 · 2017-04-25T23:57:56Z

@apaszke when running training program, ipcs -a actually shows no shared memory being used. Is that expected?

zym1010 · 2017-04-26T01:06:56Z

@apaszke tried running the program (still 22 workers) with following setting on shared mem, and stuck again.

yimengzh_everyday@yimengzh:~$ ipcs -lm

------ Shared Memory Limits --------
max number of segments = 8192
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18446744073642442748
min seg size (bytes) = 1

didn't try one worker. first, that would be slow; second, if the problem is really dead locking, then it would definitely disappear.

apaszke · 2017-04-26T08:20:02Z

@zym1010 default settings doesn't have to be created with such workloads in mind, so yes it might have been an issue. ipcs is for System V shared memory which we aren't using, but I wanted to make sure the same limits don't apply to POSIX shared memory.

It wouldn't definitely disappear, because if the problem is really there, then it's likely a deadlock between the worker and main process, and one worker might be enough to trigger this. Anyway, I can't fix the issue until I can reproduce it. What are the parameters you're using to run the example and did you modify the code in any way? Also, what's the value of torch.__version__? Are you running in docker?

zym1010 · 2017-04-26T12:43:05Z

@apaszke Thanks. I understand your analysis much better now.

All other results shown to you up to how are performed on a Ubuntu 14.04 machine with 64GB RAM, dual Xeon, and Titan Black (there's also a K40, but I didn't use it).

The command to generate the problem is CUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 20 --lr 0.01 --workers 22 --batch-size 256 /mnt/temp_drive_3/cv_datasets/ILSVRC2015/Data/CLS-LOC. I didn't modify code at all.

I installed pytorch through pip, on Python 3.5. pytorch version is 0.1.11_5. Not running in Docker.

BTW, I also tried using 1 worker. But I did it on another machine (128GB RAM, dual Xeon, 4 Pascal Titan X, CentOS 6). I ran it using CUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 1 --lr 0.01 --workers 1 --batch-size 256 /ssd/cv_datasets/ILSVRC2015/Data/CLS-LOC, and the error log is as follows.

Epoch: [0][5003/5005]   Time 2.463 (2.955)      Data 2.414 (2.903)      Loss 5.9677 (6.6311)    Prec@1 3.516 (0.545)    Prec@5 8.594 (2.262)
Epoch: [0][5004/5005]   Time 1.977 (2.955)      Data 1.303 (2.903)      Loss 5.9529 (6.6310)    Prec@1 1.399 (0.545)    Prec@5 7.692 (2.262)
^CTraceback (most recent call last):
  File "main.py", line 292, in <module>
    main()
  File "main.py", line 137, in main
    prec1 = validate(val_loader, model, criterion)
  File "main.py", line 210, in validate
    for i, (input, target) in enumerate(val_loader):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 168, in __next__
    idx, batch = self.data_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/threading.py", line 293, in wait
    waiter.acquire()

the top showed the following when stuck with 1 worker.

top - 08:34:33 up 15 days, 20:03,  0 users,  load average: 0.37, 0.39, 0.36
Tasks: 894 total,   1 running, 892 sleeping,   0 stopped,   1 zombie
Cpu(s):  7.2%us,  2.8%sy,  0.0%ni, 89.7%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132196824k total, 131461528k used,   735296k free,   347448k buffers
Swap:  2047996k total,    22656k used,  2025340k free, 125226796k cached

zym1010 · 2017-04-26T15:47:15Z

another thing I found is that, if I modified the training code, so that it won't go through all batches, let say, only train 50 batches

if i >= 50:
    break

then the deadlock seems to disappear.

zym1010 · 2017-04-27T04:00:40Z

further testing seems to suggest that, this freezing much more frequently happens if I ran the program just after rebooting the computer. After there's some cache in the computer, seems that the frequency of getting this freezing is less.

apaszke · 2017-05-03T22:54:01Z

I tried, but I can't reproduce this bug in any way.

tiancheng-zhi · 2017-05-04T04:06:59Z

I met a similar issue: the data loader stops when I finish an epoch and will start a new epoch.

tiancheng-zhi · 2017-05-04T04:23:58Z

Setting num_workers = 0 works. But the program slows down.

zym1010 · 2017-05-09T04:48:23Z

@apaszke have you tried first rebooting the computer and then running the programs? For me, this guarantees the freezing. I just tried 0.12 version, and it's still the same.

One thing I'd like to point out is that I installed the pytorch using pip, as I have a OpenBLAS-linked numpy installed and the MKL from @soumith 's anaconda cloud wouldn't play with it well.

So essentially pytorch is using MKL and numpy is using OpenBLAS. This may not be ideal, but I think this should have nothing to do with the issue here.

apaszke · 2017-05-09T09:11:31Z

I looked into it, but I could never reproduce it. MKL/OpenBLAS should be unrelated to this problem. It's probably some problem with a system configuration

zym1010 · 2017-05-09T13:37:59Z

@apaszke thanks. I just tried the python from anaconda official repo and MKL based pytorch. Still the same problem.

zym1010 · 2017-05-10T22:06:54Z

tried running the code in Docker. Still stuck.

jsainio · 2017-06-07T14:35:33Z

We have the same problem, running the pytorch/examples imagenet training example (resnet18, 4 workers) inside an nvidia-docker using 1 GPU out of 4. I'll try to gather a gdb backtrace, if I manage to get to the process.

At least OpenBLAS is known to have a deadlock issue in matrix multiplication, which occurs relatively rarely: OpenMathLib/OpenBLAS#937. This bug was present at least in OpenBLAS packaged in numpy 1.12.0.

zym1010 · 2017-06-07T14:59:31Z

@jsainio I also tried pure MKL based PyTorch (numpy is linked with MKL as well), and same problem.

Also, this problem is solved (at least for me), if I turn of pin_memory for dataloader.

jsainio · 2017-06-09T09:09:21Z

It looks as if two of the workers die out.

During normal operation:

root@b06f896d5c1d:~/mnt# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user+        1 33.2  4.7 91492324 3098288 ?    Ssl  10:51   1:10 python -m runne
user+       58 76.8  2.3 91079060 1547512 ?    Rl   10:54   1:03 python -m runne
user+       59 76.0  2.2 91006896 1484536 ?    Rl   10:54   1:02 python -m runne
user+       60 76.4  2.3 91099448 1559992 ?    Rl   10:54   1:02 python -m runne
user+       61 79.4  2.2 91008344 1465292 ?    Rl   10:54   1:05 python -m runne

after locking up:

root@b06f896d5c1d:~/mnt# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user+        1 24.8  4.4 91509728 2919744 ?    Ssl  14:25  13:01 python -m runne
user+       58 51.7  0.0      0     0 ?        Z    14:27  26:20 [python] <defun
user+       59 52.1  0.0      0     0 ?        Z    14:27  26:34 [python] <defun
user+       60 52.0  2.4 91147008 1604628 ?    Sl   14:27  26:31 python -m runne
user+       61 52.0  2.3 91128424 1532088 ?    Sl   14:27  26:29 python -m runne

For one still remaining workers, the beginning of the gdb stacktrace looks like:

root@b06f896d5c1d:~/mnt# gdb --pid 60
GNU gdb (GDB) 8.0
Attaching to process 60
[New LWP 65]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f36f52af827 in do_futex_wait.constprop ()
   from /lib/x86_64-linux-gnu/libpthread.so.0

(gdb) bt
#0  0x00007f36f52af827 in do_futex_wait.constprop ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f36f52af8d4 in __new_sem_wait_slow.constprop.0 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f36f52af97a in sem_wait@@GLIBC_2.2.5 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f36f157efb1 in semlock_acquire (self=0x7f3656296458,
    args=<optimized out>, kwds=<optimized out>)
    at /home/ilan/minonda/conda-bld/work/Python-3.5.2/Modules/_multiprocessing/semaphore.c:307
#4  0x00007f36f5579621 in PyCFunction_Call (func=
    <built-in method __enter__ of _multiprocessing.SemLock object at remote 0x7f3656296458>, args=(), kwds=<optimized out>) at Objects/methodobject.c:98
#5  0x00007f36f5600bd5 in call_function (oparg=<optimized out>,
    pp_stack=0x7f36c7ffbdb8) at Python/ceval.c:4705
#6  PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3236
#7  0x00007f36f5601b49 in _PyEval_EvalCodeWithName (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0,
    closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#8  0x00007f36f5601cd8 in PyEval_EvalCodeEx (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0,
    defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#9  0x00007f36f5557542 in function_call (
    func=<function at remote 0x7f36561c7d08>,
    arg=(<Lock(release=<built-in method release of _multiprocessing.SemLock object at remote 0x7f3656296458>, acquire=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7f3656296458>, _semlock=<_multiprocessing.SemLock at remote 0x7f3656296458>) at remote 0x7f3656296438>,), kw=0x0)
    at Objects/funcobject.c:627
#10 0x00007f36f5524236 in PyObject_Call (
    func=<function at remote 0x7f36561c7d08>, arg=<optimized out>,
    kw=<optimized out>) at Objects/abstract.c:2165
#11 0x00007f36f554077c in method_call (
    func=<function at remote 0x7f36561c7d08>,
    arg=(<Lock(release=<built-in method release of _multiprocessing.SemLock object at remote 0x7f3656296458>, acquire=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7f3656296458>, _semlock=<_multiprocessing.SemLock at remote 0x7f3656296458>) at remote 0x7f3656296438>,), kw=0x0)
    at Objects/classobject.c:330
#12 0x00007f36f5524236 in PyObject_Call (
    func=<method at remote 0x7f36556f9248>, arg=<optimized out>,
    kw=<optimized out>) at Objects/abstract.c:2165
#13 0x00007f36f55277d9 in PyObject_CallFunctionObjArgs (
    callable=<method at remote 0x7f36556f9248>) at Objects/abstract.c:2445
#14 0x00007f36f55fc3a9 in PyEval_EvalFrameEx (f=<optimized out>,
    throwflag=<optimized out>) at Python/ceval.c:3107
#15 0x00007f36f5601166 in fast_function (nk=<optimized out>, na=1,
    n=<optimized out>, pp_stack=0x7f36c7ffc418,
    func=<function at remote 0x7f36561c78c8>) at Python/ceval.c:4803
#16 call_function (oparg=<optimized out>, pp_stack=0x7f36c7ffc418)
    at Python/ceval.c:4730
#17 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3236
#18 0x00007f36f5601b49 in _PyEval_EvalCodeWithName (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=4, kws=0x7f36f5b85060, kwcount=0, defs=0x0, defcount=0,
    kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#19 0x00007f36f5601cd8 in PyEval_EvalCodeEx (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0,
    defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#20 0x00007f36f5557661 in function_call (
    func=<function at remote 0x7f36e14170d0>,
    arg=(<ImageFolder(class_to_idx={'n04153751': 783, 'n02051845': 144, 'n03461385': 582, 'n04350905': 834, 'n02105056': 224, 'n02112137': 260, 'n03938244': 721, 'n01739381': 59, 'n01797886': 82, 'n04286575': 818, 'n02113978': 268, 'n03998194': 741, 'n15075141': 999, 'n03594945': 609, 'n04099969': 765, 'n02002724': 128, 'n03131574': 520, 'n07697537': 934, 'n04380533': 846, 'n02114712': 271, 'n01631663': 27, 'n04259630': 808, 'n04326547': 825, 'n02480855': 366, 'n02099429': 206, 'n03590841': 607, 'n02497673': 383, 'n09332890': 975, 'n02643566': 396, 'n03658185': 623, 'n04090263': 764, 'n03404251': 568, 'n03627232': 616, 'n01534433': 13, 'n04476259': 868, 'n03495258': 594, 'n04579145': 901, 'n04266014': 812, 'n01665541': 34, 'n09472597': 980, 'n02095570': 189, 'n02089867': 166, 'n02009229': 131, 'n02094433': 187, 'n04154565': 784, 'n02107312': 237, 'n04372370': 844, 'n02489166': 376, 'n03482405': 588, 'n04040759': 753, 'n01774750': 76, 'n01614925': 22, 'n01855032': 98, 'n03903868': 708, 'n02422699': 352, 'n01560419': 1...(truncated), kw={}) at Objects/funcobject.c:627
#21 0x00007f36f5524236 in PyObject_Call (
    func=<function at remote 0x7f36e14170d0>, arg=<optimized out>,
    kw=<optimized out>) at Objects/abstract.c:2165
#22 0x00007f36f55fe234 in ext_do_call (nk=1444355432, na=0,
    flags=<optimized out>, pp_stack=0x7f36c7ffc768,
    func=<function at remote 0x7f36e14170d0>) at Python/ceval.c:5034
#23 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3275
--snip--

M-Eng · 2017-06-09T12:11:19Z

I had similar error log, with the main process stuck on: self.data_queue.get()
For me the problem was that I used opencv as image loader. And the cv2.imread function was hanging indefinitely without error on a particular image of imagenet ("n01630670/n01630670_1010.jpeg")

If you said it's working for you with num_workers = 0 it's not that. But I thought it might help some people with similar error trace.

jsainio · 2017-06-09T12:27:23Z

I'm running a test with num_workers = 0 currently, no hangs yet. I'm running the example code from https://github.com/pytorch/examples/blob/master/imagenet/main.py. pytorch/vision ImageFolder seems to use PIL or pytorch/accimage internally to load the images, so there's no OpenCV involved.

With num_workers = 4, I can occasionally get the first epoch train and validate fully, and it locks up in the middle of the second epoch. So, it is unlikely a problem in the dataset/loading function.

It looks something like a race condition in ImageLoader which might be triggered relatively rarely by a certain hardware/software combination.

jsainio · 2017-06-09T13:51:50Z

@zym1010 thanks for the pointer, I'll try setting pin_memory = False too for the DataLoader.

RaymondJiangkw · 2021-08-30T03:29:32Z

To those who still are stuck by the problem even if all above methods have been applied, remember to use model.module.forward when validating your model during training instead of model.forward since model is an instance of DistributedDataParallel and with torch.no_grad() context may cause problems.

yianzhongguo · 2021-09-27T23:55:53Z

I also encountered a similar problem. I simplified the train.py to only contain
data_loader = CreateDataLoader(opt)
dataset = data_loader.load_data()
for i, data in enumerate(dataset, start=epoch_iter):
print(i)
But it still got stuck and there were no other responses and even the "i" could not be printed, when I set num_workers>0 (even num_workers =1). Thus, I thought the issue was caused by torch.utils.data.DataLoader. It was weird that this program train.py could run through only two weeks ago and I did not change anything to my server and code during these days. My OS is CentOS 7.9. PyTorch 1.8.0 was installed in the virtual environment Python 3.8 created by conda of anaconda3-2021.05-Linux-x86_64. I did not use "cv2" in my code and I found that only a little part of the memory and shared memory were used. When I set num_workers=0, it worked, but it was too slow.

csvance · 2021-11-12T16:09:52Z

I have tried everything mentioned here:

Increase ulimit
Increase SHM size
cv2 thread counts (this doesn't matter though because I am using spawn for workers when I do use them)
0 workers for DataLoaders
Sleep between epochs / before evaluation

Still for an object detection problem I am working on I get a training deadlock after the first epoch of training 100% of the time when using DDP where the training process gets stuck waiting on DataLoader. If I don't use DDP, there is no deadlock, that is the only thing that fixes the issue for me. This is with PyTorch 1.10.0 / CUDA 11.3 and PyTorch 1.8.1 / CUDA 10.2.

Essentially what happens is at the start of training there are 3 processes when doing DDP with 0 workers and 1 GPU. When the hang happens, the main training process gets stuck on iterating over the dataloader and goes to 0% CPU usage. The other two processes are at 100% CPU. When using two GPU, we start with 4 processes, one of the training process hangs, the other one uses 100% CPU/GPU, and the other two process use 100% CPU. I have not been able to get a stack trace for any of the other processes unfortunately.

Non DDP training works flawlessly.

leeeizhang · 2021-12-19T12:33:47Z

not fix yet...

diaoenmao · 2021-12-23T20:29:12Z

I recently come across a situation where I need to load many small images. My work station has a CPU with 22 cores and four GPUs, so I run four experiments with different random seeds, and each experiment uses one seperate GPU. I find out that the run time of four processes is amost four times the run time of a single process (No parallel benefit).

The model I train is relatively small and the most time-consuming part acutally comes from data loading. I have tried many different approachs including:

pin_memory =False/True
num_workers = 0/1/8
Increase ulimit
staggering the start of each experiment

Thanks to the system level diagnosis by @vjorlikowski, we find out that if we set num_workers = 0/1/8, each process will try to use all CPU cores and viciously compete with each other for CPU cores.

Solution:
Use export OMP_NUM_THREADS=N, as described here
or use torch.set_num_threads(N), as described here
We set num_workers = 0 and N=5 in our case, as we have 22 cores. The estimated run time of my program is reduced from 12 days to 1.5 days.

* Refactor War Sync Insertion Pass (pytorch#1339) * Remove kir::Expr::scope_ (pytorch#1341) * Fusion IR Refactor (pytorch#1343) * Refactor KIR Step 1 - Remove kir::Node (pytorch#1347) * Refactor KIR Step 2 - TMP IrUtils change (pytorch#1348) * Refactor KIR Step 3 - Remove kir::Expr and kir::Val. (pytorch#1349) * Refactor KIR Step 4 - Remove kir::Bool,Double,Int,NamedScalar. (pytorch#1350) * Refactor KIR Step 5 - Remove kir::IterDomain/TensorDomain/TensorView (pytorch#1351) * Refactor KIR Step 6 - Remove kir::UnaryOp/BinaryOp/TernaryOp/ReductionOp/WelfordOp/BroadcastOp. (pytorch#1352) * Refactor KIR Step 7 - Remove kir dispatch (pytorch#1353) * Refactor KIR Step 8 - Clean up lower_utils (pytorch#1355) * Refactor KIR Step 9 - lower_utils ir_utils::applyReplacements. (pytorch#1354) * Refactor KIR Step 10 - Remove kir_printer in favor of io_stream (pytorch#1356)

The1912 · 2022-05-02T00:14:35Z

I recently come across a situation where I need to load many small images. My work station has a CPU with 22 cores and four GPUs, so I run four experiments with different random seeds, and each experiment uses one seperate GPU. I find out that the run time of four processes is amost four times the run time of a single process (No parallel benefit).

The model I train is relatively small and the most time-consuming part acutally comes from data loading. I have tried many different approachs including:

pin_memory =False/True

num_workers = 0/1/8

Increase ulimit

staggering the start of each experiment

Thanks to the system level diagnosis by @vjorlikowski, we find out that if we set num_workers = 0/1/8, each process will try to use all CPU cores and viciously compete with each other for CPU cores.

Solution: Use export OMP_NUM_THREADS=N, as described here or use torch.set_num_threads(N), as described here We set num_workers = 0 and N=5 in our case, as we have 22 cores. The estimated run time of my program is reduced from 12 days to 1.5 days.

Thank u ,i fix it by torch.set_num_threads(N)

opeide · 2022-05-25T10:41:58Z

For me the issue was apparently in my training augmentations. In albumentations there are some augmentations that can infiniteloop, like randomfog. I was only able to see where the code froze when i set num_workers=0.

namespace-Pt · 2022-06-04T12:45:28Z

any progress? came across the same issue when using DDP.

Everything works fine without DDP. However, when I create a dataloader for validation only on rank==0, this dataloader freezes if num_workers>0.

pcicales · 2022-06-14T20:57:44Z

@namespace-Pt I have experienced a similar issue, except with pin_memory=True when accumulating evaluation results. I have not yet tried @dem123456789 solution, but it seems to work for others. Would be great to have some official guidance on this issue from the devs though.

namespace-Pt · 2022-06-17T05:50:11Z

@pcicales Thanks but it not work for me. (torch==1.10.1+cu111)

lifangda01 · 2022-07-06T04:45:43Z

The solution that worked for me is

import multiprocessing as mp

if __name__ == "__main__":
    mp.set_start_method('spawn')
    main()

After this I can use dataloader with num_workers > 0 and pin_memory = True without any problem.

czy97 · 2022-08-07T04:41:08Z

What about this: Go to maskrcnn_benchmark/data/build.py In line 161 add "torch.multiprocessing.set_sharing_strategy('file_system')"

So the code will end like this:
    data_loaders = []
    
    torch.multiprocessing.set_sharing_strategy('file_system')
    for dataset in datasets:
That fixed it for me

This setting can solve the hanging problem, but may cause some other error:
File "/mnt/lustre/pytorch1_12/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 444, in iter
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
self._start_thread()

thistlillo · 2022-08-16T07:27:08Z

I have this problem as of August 2022. Dataloader freezes, mostly randomly, whenever I use num_workers > 0.

HazelHik · 2022-09-06T16:53:26Z

same problem with pytorch 1.8 in anaconda. The training stuck after finishing the first epoch.
Output like:
Epoch 2
0%| | 0/118680 [00:00<?, ?it/s]
.... and hangs there for hours until I kill it
And it seems to happen randomly, I was able to reproduce this in a small dataset, but when I look into it using the debug mode in pycharm, it goes smooth.

ker2xu · 2022-09-30T02:06:31Z

I would like to offer another solution for people who use the Python compiled from sources by themselves. Remember NOT to configure with --enable-profiling option. The --enable-profiling option will prevent you from using num_workers > 0 and all solutions above cannot help solve it.
Hope this reply helps.

Hiroshiba · 2022-11-16T03:40:07Z

In my case, specifying persistent_workers=True as a DataLoader argument was all I wanted.

I will write the detailed conditions. I hope it will be helpful to someone else.

I was using the latest official PyTorch Docker Image
DataLoader hung when a random number of epochs elapsed.
Nothing is shown in the error log.
It worked correctly with num_worker=0.
I don't use OpenCV, but I use wandb and tensorboard.

aa88kk mentioned this issue May 20, 2017

DataLoader workers deadlocked #1595

Closed

apaszke mentioned this issue May 24, 2017

DataLoader hangs with num_workers > 0 #1579

Closed

Sander-houqi mentioned this issue Nov 4, 2021

training slower and slower IrvingMeng/MagFace#33

Closed

18445864529 mentioned this issue Nov 8, 2021

DataLoader worker (pid 12847) is killed by signal: Killed TorchSSL/TorchSSL#4

Closed

csvance mentioned this issue Nov 11, 2021

[2.18.0] Distributed Training DataLoader Deadlock open-mmlab/mmdetection#6486

Closed

huanghaifeng1234 mentioned this issue Nov 16, 2021

training stuck without response ultralytics/yolov5#5637

Closed

2 tasks

zhouzaida mentioned this issue Jan 4, 2022

why need time.sleep(2) in EpochBasedRunner ? when the deadlock will happen ? open-mmlab/mmcv#1640

Open

tjyuyao mentioned this issue Jan 17, 2022

unable to open shared memory object in read-write mode tjyuyao/ice-learn#2

Closed

mittagessen mentioned this issue Feb 11, 2022

Exception during training mittagessen/kraken#324

Closed

mb010 mentioned this issue Mar 7, 2022

Multi GPU mb010/AstroAugmentations#2

Closed

Khoa-NT mentioned this issue Apr 5, 2022

DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes #13246

Open

pcicales mentioned this issue Apr 24, 2022

Severe memory leaks when num_workers != 0 open-mmlab/mmdetection#7786

Open

nankepan mentioned this issue May 16, 2022

possible deadlock in dataloader scutpaul/DANet#8

Closed

rrjia mentioned this issue Aug 11, 2022

训练过程中，内存一直增长，到后期会把整个服务器的内存都占完 JDAI-CV/fast-reid#673

Closed

xiaokening mentioned this issue Aug 24, 2023

A problem when using the pecos model to train xtransformer amzn/pecos#218

Open

possible deadlock in dataloader #1355

possible deadlock in dataloader #1355

Comments

zym1010 commented Apr 25, 2017

apaszke commented Apr 25, 2017

zym1010 commented Apr 25, 2017

zym1010 commented Apr 25, 2017

ngimel commented Apr 25, 2017

zym1010 commented Apr 25, 2017 • edited

apaszke commented Apr 25, 2017

zym1010 commented Apr 25, 2017

apaszke commented Apr 25, 2017 • edited

zym1010 commented Apr 25, 2017

zym1010 commented Apr 25, 2017

zym1010 commented Apr 26, 2017 • edited

apaszke commented Apr 26, 2017 • edited

zym1010 commented Apr 26, 2017 • edited

zym1010 commented Apr 26, 2017

zym1010 commented Apr 27, 2017

apaszke commented May 3, 2017

tiancheng-zhi commented May 4, 2017

tiancheng-zhi commented May 4, 2017

zym1010 commented May 9, 2017 • edited

apaszke commented May 9, 2017

zym1010 commented May 9, 2017

zym1010 commented May 10, 2017

jsainio commented Jun 7, 2017

zym1010 commented Jun 7, 2017 • edited

jsainio commented Jun 9, 2017

M-Eng commented Jun 9, 2017

jsainio commented Jun 9, 2017 • edited

jsainio commented Jun 9, 2017

RaymondJiangkw commented Aug 30, 2021

yianzhongguo commented Sep 27, 2021 • edited

csvance commented Nov 12, 2021 • edited

leeeizhang commented Dec 19, 2021

diaoenmao commented Dec 23, 2021 • edited

The1912 commented May 2, 2022

opeide commented May 25, 2022

namespace-Pt commented Jun 4, 2022

pcicales commented Jun 14, 2022

namespace-Pt commented Jun 17, 2022

lifangda01 commented Jul 6, 2022

czy97 commented Aug 7, 2022 • edited

thistlillo commented Aug 16, 2022

HazelHik commented Sep 6, 2022 • edited

ker2xu commented Sep 30, 2022

Hiroshiba commented Nov 16, 2022

zym1010 commented Apr 25, 2017 •

edited

apaszke commented Apr 25, 2017 •

edited

zym1010 commented Apr 26, 2017 •

edited

apaszke commented Apr 26, 2017 •

edited

zym1010 commented Apr 26, 2017 •

edited

zym1010 commented May 9, 2017 •

edited

zym1010 commented Jun 7, 2017 •

edited

jsainio commented Jun 9, 2017 •

edited

yianzhongguo commented Sep 27, 2021 •

edited

csvance commented Nov 12, 2021 •

edited

diaoenmao commented Dec 23, 2021 •

edited

czy97 commented Aug 7, 2022 •

edited

HazelHik commented Sep 6, 2022 •

edited