Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible deadlock in dataloader #1355

Closed
zym1010 opened this issue Apr 25, 2017 · 213 comments
Closed

possible deadlock in dataloader #1355

zym1010 opened this issue Apr 25, 2017 · 213 comments

Comments

@zym1010
Copy link
Contributor

zym1010 commented Apr 25, 2017

the bug is described at pytorch/examples#148. I just wonder if this is a bug in PyTorch itself, as the example code looks clean to me. Also, I wonder if this is related to #1120.

@apaszke
Copy link
Contributor

apaszke commented Apr 25, 2017

How much free memory do you have when the loader stops?

@zym1010
Copy link
Contributor Author

zym1010 commented Apr 25, 2017

@apaszke if I check top, the remaining memory (cached mem also counts as used) is usually 2GB. But if you don't count cached as used, it's always a lot, say 30GB+.

@zym1010
Copy link
Contributor Author

zym1010 commented Apr 25, 2017

Also I don't understand why it always stops at beginning of validation, but not everywhere else.

@ngimel
Copy link
Collaborator

ngimel commented Apr 25, 2017

Possibly because for validation a separate loader is used that pushes the use of shared memory over the limit.

@zym1010
Copy link
Contributor Author

zym1010 commented Apr 25, 2017

@ngimel

I just ran the program again. And got stuck.

Output of top:

top - 17:51:18 up 2 days, 21:05,  2 users,  load average: 0.49, 3.00, 5.41
Tasks: 357 total,   2 running, 355 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.9 us,  0.1 sy,  0.7 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  65863816 total, 60115084 used,  5748732 free,  1372688 buffers
KiB Swap:  5917692 total,      620 used,  5917072 free. 51154784 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                              3067 aalreja   20   0  143332 101816  21300 R  46.1  0.2   1631:44 Xvnc
16613 aalreja   30  10   32836   4880   3912 S  16.9  0.0   1:06.92 fiberlamp                            3221 aalreja   20   0 8882348 1.017g 110120 S   1.3  1.6 579:06.87 MATLAB
 1285 root      20   0 1404848  48252  25580 S   0.3  0.1   6:00.12 dockerd                             16597 yimengz+  20   0   25084   3252   2572 R   0.3  0.0   0:04.56 top
    1 root      20   0   33616   4008   2624 S   0.0  0.0   0:01.43 init

Output of free

yimengzh_everyday@yimengzh:~$ free
             total       used       free     shared    buffers     cached
Mem:      65863816   60122060    5741756    9954628    1372688   51154916
-/+ buffers/cache:    7594456   58269360
Swap:      5917692        620    5917072

Output of nvidia-smi

yimengzh_everyday@yimengzh:~$ nvidia-smi
Tue Apr 25 17:52:38 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 30%   42C    P8    14W / 250W |   3986MiB /  6082MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          Off  | 0000:81:00.0     Off |                  Off |
|  0%   46C    P0    57W / 235W |      0MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     16509    C   python                                        3970MiB |
+-----------------------------------------------------------------------------+

I don't think it's a memory issue.

@apaszke
Copy link
Contributor

apaszke commented Apr 25, 2017

There are separate limits for shared memory. Can you try ipcs -lm or cat /proc/sys/kernel/shmall and cat /proc/sys/kernel/shmmax? Also, does it deadlock if you use fewer workers (e.g. test with the extreme case of 1 worker)?

@zym1010
Copy link
Contributor Author

zym1010 commented Apr 25, 2017

@apaszke

yimengzh_everyday@yimengzh:~$ ipcs -lm

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18446744073642442748
min seg size (bytes) = 1

yimengzh_everyday@yimengzh:~$ cat /proc/sys/kernel/shmall
18446744073692774399
yimengzh_everyday@yimengzh:~$ cat /proc/sys/kernel/shmmax
18446744073692774399

how do they look for you?

as for fewer workers, I believe it won't happen that often. (I can try now). But I think in practice I need that many workers.

@apaszke
Copy link
Contributor

apaszke commented Apr 25, 2017

You have a max of 4096 shared memory segments allowed, maybe that's an issue. You can try increasing that by writing to /proc/sys/kernel/shmmni (maybe try 8192). You may need superuser privileges.

@zym1010
Copy link
Contributor Author

zym1010 commented Apr 25, 2017

@apaszke well these are default values by both Ubuntu and CentOS 6... Is that really an issue?

@zym1010
Copy link
Contributor Author

zym1010 commented Apr 25, 2017

@apaszke when running training program, ipcs -a actually shows no shared memory being used. Is that expected?

@zym1010
Copy link
Contributor Author

zym1010 commented Apr 26, 2017

@apaszke tried running the program (still 22 workers) with following setting on shared mem, and stuck again.

yimengzh_everyday@yimengzh:~$ ipcs -lm

------ Shared Memory Limits --------
max number of segments = 8192
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18446744073642442748
min seg size (bytes) = 1

didn't try one worker. first, that would be slow; second, if the problem is really dead locking, then it would definitely disappear.

@apaszke
Copy link
Contributor

apaszke commented Apr 26, 2017

@zym1010 default settings doesn't have to be created with such workloads in mind, so yes it might have been an issue. ipcs is for System V shared memory which we aren't using, but I wanted to make sure the same limits don't apply to POSIX shared memory.

It wouldn't definitely disappear, because if the problem is really there, then it's likely a deadlock between the worker and main process, and one worker might be enough to trigger this. Anyway, I can't fix the issue until I can reproduce it. What are the parameters you're using to run the example and did you modify the code in any way? Also, what's the value of torch.__version__? Are you running in docker?

@zym1010
Copy link
Contributor Author

zym1010 commented Apr 26, 2017

@apaszke Thanks. I understand your analysis much better now.

All other results shown to you up to how are performed on a Ubuntu 14.04 machine with 64GB RAM, dual Xeon, and Titan Black (there's also a K40, but I didn't use it).

The command to generate the problem is CUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 20 --lr 0.01 --workers 22 --batch-size 256 /mnt/temp_drive_3/cv_datasets/ILSVRC2015/Data/CLS-LOC. I didn't modify code at all.

I installed pytorch through pip, on Python 3.5. pytorch version is 0.1.11_5. Not running in Docker.

BTW, I also tried using 1 worker. But I did it on another machine (128GB RAM, dual Xeon, 4 Pascal Titan X, CentOS 6). I ran it using CUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 1 --lr 0.01 --workers 1 --batch-size 256 /ssd/cv_datasets/ILSVRC2015/Data/CLS-LOC, and the error log is as follows.

Epoch: [0][5003/5005]   Time 2.463 (2.955)      Data 2.414 (2.903)      Loss 5.9677 (6.6311)    Prec@1 3.516 (0.545)    Prec@5 8.594 (2.262)
Epoch: [0][5004/5005]   Time 1.977 (2.955)      Data 1.303 (2.903)      Loss 5.9529 (6.6310)    Prec@1 1.399 (0.545)    Prec@5 7.692 (2.262)
^CTraceback (most recent call last):
  File "main.py", line 292, in <module>
    main()
  File "main.py", line 137, in main
    prec1 = validate(val_loader, model, criterion)
  File "main.py", line 210, in validate
    for i, (input, target) in enumerate(val_loader):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 168, in __next__
    idx, batch = self.data_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/threading.py", line 293, in wait
    waiter.acquire()

the top showed the following when stuck with 1 worker.

top - 08:34:33 up 15 days, 20:03,  0 users,  load average: 0.37, 0.39, 0.36
Tasks: 894 total,   1 running, 892 sleeping,   0 stopped,   1 zombie
Cpu(s):  7.2%us,  2.8%sy,  0.0%ni, 89.7%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132196824k total, 131461528k used,   735296k free,   347448k buffers
Swap:  2047996k total,    22656k used,  2025340k free, 125226796k cached

@zym1010
Copy link
Contributor Author

zym1010 commented Apr 26, 2017

another thing I found is that, if I modified the training code, so that it won't go through all batches, let say, only train 50 batches

if i >= 50:
    break

then the deadlock seems to disappear.

@zym1010
Copy link
Contributor Author

zym1010 commented Apr 27, 2017

further testing seems to suggest that, this freezing much more frequently happens if I ran the program just after rebooting the computer. After there's some cache in the computer, seems that the frequency of getting this freezing is less.

@apaszke
Copy link
Contributor

apaszke commented May 3, 2017

I tried, but I can't reproduce this bug in any way.

@tiancheng-zhi
Copy link

I met a similar issue: the data loader stops when I finish an epoch and will start a new epoch.

@tiancheng-zhi
Copy link

Setting num_workers = 0 works. But the program slows down.

@zym1010
Copy link
Contributor Author

zym1010 commented May 9, 2017

@apaszke have you tried first rebooting the computer and then running the programs? For me, this guarantees the freezing. I just tried 0.12 version, and it's still the same.

One thing I'd like to point out is that I installed the pytorch using pip, as I have a OpenBLAS-linked numpy installed and the MKL from @soumith 's anaconda cloud wouldn't play with it well.

So essentially pytorch is using MKL and numpy is using OpenBLAS. This may not be ideal, but I think this should have nothing to do with the issue here.

@apaszke
Copy link
Contributor

apaszke commented May 9, 2017

I looked into it, but I could never reproduce it. MKL/OpenBLAS should be unrelated to this problem. It's probably some problem with a system configuration

@zym1010
Copy link
Contributor Author

zym1010 commented May 9, 2017

@apaszke thanks. I just tried the python from anaconda official repo and MKL based pytorch. Still the same problem.

@zym1010
Copy link
Contributor Author

zym1010 commented May 10, 2017

tried running the code in Docker. Still stuck.

@jsainio
Copy link

jsainio commented Jun 7, 2017

We have the same problem, running the pytorch/examples imagenet training example (resnet18, 4 workers) inside an nvidia-docker using 1 GPU out of 4. I'll try to gather a gdb backtrace, if I manage to get to the process.

At least OpenBLAS is known to have a deadlock issue in matrix multiplication, which occurs relatively rarely: OpenMathLib/OpenBLAS#937. This bug was present at least in OpenBLAS packaged in numpy 1.12.0.

@zym1010
Copy link
Contributor Author

zym1010 commented Jun 7, 2017

@jsainio I also tried pure MKL based PyTorch (numpy is linked with MKL as well), and same problem.

Also, this problem is solved (at least for me), if I turn of pin_memory for dataloader.

@jsainio
Copy link

jsainio commented Jun 9, 2017

It looks as if two of the workers die out.

During normal operation:

root@b06f896d5c1d:~/mnt# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user+        1 33.2  4.7 91492324 3098288 ?    Ssl  10:51   1:10 python -m runne
user+       58 76.8  2.3 91079060 1547512 ?    Rl   10:54   1:03 python -m runne
user+       59 76.0  2.2 91006896 1484536 ?    Rl   10:54   1:02 python -m runne
user+       60 76.4  2.3 91099448 1559992 ?    Rl   10:54   1:02 python -m runne
user+       61 79.4  2.2 91008344 1465292 ?    Rl   10:54   1:05 python -m runne

after locking up:

root@b06f896d5c1d:~/mnt# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user+        1 24.8  4.4 91509728 2919744 ?    Ssl  14:25  13:01 python -m runne
user+       58 51.7  0.0      0     0 ?        Z    14:27  26:20 [python] <defun
user+       59 52.1  0.0      0     0 ?        Z    14:27  26:34 [python] <defun
user+       60 52.0  2.4 91147008 1604628 ?    Sl   14:27  26:31 python -m runne
user+       61 52.0  2.3 91128424 1532088 ?    Sl   14:27  26:29 python -m runne

For one still remaining workers, the beginning of the gdb stacktrace looks like:

root@b06f896d5c1d:~/mnt# gdb --pid 60
GNU gdb (GDB) 8.0
Attaching to process 60
[New LWP 65]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f36f52af827 in do_futex_wait.constprop ()
   from /lib/x86_64-linux-gnu/libpthread.so.0

(gdb) bt
#0  0x00007f36f52af827 in do_futex_wait.constprop ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f36f52af8d4 in __new_sem_wait_slow.constprop.0 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f36f52af97a in sem_wait@@GLIBC_2.2.5 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f36f157efb1 in semlock_acquire (self=0x7f3656296458,
    args=<optimized out>, kwds=<optimized out>)
    at /home/ilan/minonda/conda-bld/work/Python-3.5.2/Modules/_multiprocessing/semaphore.c:307
#4  0x00007f36f5579621 in PyCFunction_Call (func=
    <built-in method __enter__ of _multiprocessing.SemLock object at remote 0x7f3656296458>, args=(), kwds=<optimized out>) at Objects/methodobject.c:98
#5  0x00007f36f5600bd5 in call_function (oparg=<optimized out>,
    pp_stack=0x7f36c7ffbdb8) at Python/ceval.c:4705
#6  PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3236
#7  0x00007f36f5601b49 in _PyEval_EvalCodeWithName (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0,
    closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#8  0x00007f36f5601cd8 in PyEval_EvalCodeEx (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0,
    defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#9  0x00007f36f5557542 in function_call (
    func=<function at remote 0x7f36561c7d08>,
    arg=(<Lock(release=<built-in method release of _multiprocessing.SemLock object at remote 0x7f3656296458>, acquire=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7f3656296458>, _semlock=<_multiprocessing.SemLock at remote 0x7f3656296458>) at remote 0x7f3656296438>,), kw=0x0)
    at Objects/funcobject.c:627
#10 0x00007f36f5524236 in PyObject_Call (
    func=<function at remote 0x7f36561c7d08>, arg=<optimized out>,
    kw=<optimized out>) at Objects/abstract.c:2165
#11 0x00007f36f554077c in method_call (
    func=<function at remote 0x7f36561c7d08>,
    arg=(<Lock(release=<built-in method release of _multiprocessing.SemLock object at remote 0x7f3656296458>, acquire=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7f3656296458>, _semlock=<_multiprocessing.SemLock at remote 0x7f3656296458>) at remote 0x7f3656296438>,), kw=0x0)
    at Objects/classobject.c:330
#12 0x00007f36f5524236 in PyObject_Call (
    func=<method at remote 0x7f36556f9248>, arg=<optimized out>,
    kw=<optimized out>) at Objects/abstract.c:2165
#13 0x00007f36f55277d9 in PyObject_CallFunctionObjArgs (
    callable=<method at remote 0x7f36556f9248>) at Objects/abstract.c:2445
#14 0x00007f36f55fc3a9 in PyEval_EvalFrameEx (f=<optimized out>,
    throwflag=<optimized out>) at Python/ceval.c:3107
#15 0x00007f36f5601166 in fast_function (nk=<optimized out>, na=1,
    n=<optimized out>, pp_stack=0x7f36c7ffc418,
    func=<function at remote 0x7f36561c78c8>) at Python/ceval.c:4803
#16 call_function (oparg=<optimized out>, pp_stack=0x7f36c7ffc418)
    at Python/ceval.c:4730
#17 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3236
#18 0x00007f36f5601b49 in _PyEval_EvalCodeWithName (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=4, kws=0x7f36f5b85060, kwcount=0, defs=0x0, defcount=0,
    kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#19 0x00007f36f5601cd8 in PyEval_EvalCodeEx (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0,
    defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#20 0x00007f36f5557661 in function_call (
    func=<function at remote 0x7f36e14170d0>,
    arg=(<ImageFolder(class_to_idx={'n04153751': 783, 'n02051845': 144, 'n03461385': 582, 'n04350905': 834, 'n02105056': 224, 'n02112137': 260, 'n03938244': 721, 'n01739381': 59, 'n01797886': 82, 'n04286575': 818, 'n02113978': 268, 'n03998194': 741, 'n15075141': 999, 'n03594945': 609, 'n04099969': 765, 'n02002724': 128, 'n03131574': 520, 'n07697537': 934, 'n04380533': 846, 'n02114712': 271, 'n01631663': 27, 'n04259630': 808, 'n04326547': 825, 'n02480855': 366, 'n02099429': 206, 'n03590841': 607, 'n02497673': 383, 'n09332890': 975, 'n02643566': 396, 'n03658185': 623, 'n04090263': 764, 'n03404251': 568, 'n03627232': 616, 'n01534433': 13, 'n04476259': 868, 'n03495258': 594, 'n04579145': 901, 'n04266014': 812, 'n01665541': 34, 'n09472597': 980, 'n02095570': 189, 'n02089867': 166, 'n02009229': 131, 'n02094433': 187, 'n04154565': 784, 'n02107312': 237, 'n04372370': 844, 'n02489166': 376, 'n03482405': 588, 'n04040759': 753, 'n01774750': 76, 'n01614925': 22, 'n01855032': 98, 'n03903868': 708, 'n02422699': 352, 'n01560419': 1...(truncated), kw={}) at Objects/funcobject.c:627
#21 0x00007f36f5524236 in PyObject_Call (
    func=<function at remote 0x7f36e14170d0>, arg=<optimized out>,
    kw=<optimized out>) at Objects/abstract.c:2165
#22 0x00007f36f55fe234 in ext_do_call (nk=1444355432, na=0,
    flags=<optimized out>, pp_stack=0x7f36c7ffc768,
    func=<function at remote 0x7f36e14170d0>) at Python/ceval.c:5034
#23 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3275
--snip--

@M-Eng
Copy link

M-Eng commented Jun 9, 2017

I had similar error log, with the main process stuck on: self.data_queue.get()
For me the problem was that I used opencv as image loader. And the cv2.imread function was hanging indefinitely without error on a particular image of imagenet ("n01630670/n01630670_1010.jpeg")

If you said it's working for you with num_workers = 0 it's not that. But I thought it might help some people with similar error trace.

@jsainio
Copy link

jsainio commented Jun 9, 2017

I'm running a test with num_workers = 0 currently, no hangs yet. I'm running the example code from https://github.com/pytorch/examples/blob/master/imagenet/main.py. pytorch/vision ImageFolder seems to use PIL or pytorch/accimage internally to load the images, so there's no OpenCV involved.

With num_workers = 4, I can occasionally get the first epoch train and validate fully, and it locks up in the middle of the second epoch. So, it is unlikely a problem in the dataset/loading function.

It looks something like a race condition in ImageLoader which might be triggered relatively rarely by a certain hardware/software combination.

@jsainio
Copy link

jsainio commented Jun 9, 2017

@zym1010 thanks for the pointer, I'll try setting pin_memory = False too for the DataLoader.

@RaymondJiangkw
Copy link

To those who still are stuck by the problem even if all above methods have been applied, remember to use model.module.forward when validating your model during training instead of model.forward since model is an instance of DistributedDataParallel and with torch.no_grad() context may cause problems.

@yianzhongguo
Copy link

yianzhongguo commented Sep 27, 2021

I also encountered a similar problem. I simplified the train.py to only contain
data_loader = CreateDataLoader(opt)
dataset = data_loader.load_data()
for i, data in enumerate(dataset, start=epoch_iter):
print(i)
But it still got stuck and there were no other responses and even the "i" could not be printed, when I set num_workers>0 (even num_workers =1). Thus, I thought the issue was caused by torch.utils.data.DataLoader. It was weird that this program train.py could run through only two weeks ago and I did not change anything to my server and code during these days. My OS is CentOS 7.9. PyTorch 1.8.0 was installed in the virtual environment Python 3.8 created by conda of anaconda3-2021.05-Linux-x86_64. I did not use "cv2" in my code and I found that only a little part of the memory and shared memory were used. When I set num_workers=0, it worked, but it was too slow.

@csvance
Copy link

csvance commented Nov 12, 2021

I have tried everything mentioned here:

  • Increase ulimit
  • Increase SHM size
  • cv2 thread counts (this doesn't matter though because I am using spawn for workers when I do use them)
  • 0 workers for DataLoaders
  • Sleep between epochs / before evaluation

Still for an object detection problem I am working on I get a training deadlock after the first epoch of training 100% of the time when using DDP where the training process gets stuck waiting on DataLoader. If I don't use DDP, there is no deadlock, that is the only thing that fixes the issue for me. This is with PyTorch 1.10.0 / CUDA 11.3 and PyTorch 1.8.1 / CUDA 10.2.

Essentially what happens is at the start of training there are 3 processes when doing DDP with 0 workers and 1 GPU. When the hang happens, the main training process gets stuck on iterating over the dataloader and goes to 0% CPU usage. The other two processes are at 100% CPU. When using two GPU, we start with 4 processes, one of the training process hangs, the other one uses 100% CPU/GPU, and the other two process use 100% CPU. I have not been able to get a stack trace for any of the other processes unfortunately.

Non DDP training works flawlessly.

@leeeizhang
Copy link

not fix yet...

@diaoenmao
Copy link

diaoenmao commented Dec 23, 2021

I recently come across a situation where I need to load many small images. My work station has a CPU with 22 cores and four GPUs, so I run four experiments with different random seeds, and each experiment uses one seperate GPU. I find out that the run time of four processes is amost four times the run time of a single process (No parallel benefit).

The model I train is relatively small and the most time-consuming part acutally comes from data loading. I have tried many different approachs including:

  • pin_memory =False/True
  • num_workers = 0/1/8
  • Increase ulimit
  • staggering the start of each experiment

Thanks to the system level diagnosis by @vjorlikowski, we find out that if we set num_workers = 0/1/8, each process will try to use all CPU cores and viciously compete with each other for CPU cores.

Solution:
Use export OMP_NUM_THREADS=N, as described here
or use torch.set_num_threads(N), as described here
We set num_workers = 0 and N=5 in our case, as we have 22 cores. The estimated run time of my program is reduced from 12 days to 1.5 days.

eqy pushed a commit to eqy/pytorch that referenced this issue Jan 20, 2022
* Refactor War Sync Insertion Pass (pytorch#1339)
* Remove kir::Expr::scope_ (pytorch#1341)
* Fusion IR Refactor (pytorch#1343)
* Refactor KIR Step 1 - Remove kir::Node (pytorch#1347)
* Refactor KIR Step 2 - TMP IrUtils change (pytorch#1348)
* Refactor KIR Step 3 - Remove kir::Expr and kir::Val. (pytorch#1349)
* Refactor KIR Step 4 - Remove kir::Bool,Double,Int,NamedScalar. (pytorch#1350)
* Refactor KIR Step 5 - Remove kir::IterDomain/TensorDomain/TensorView (pytorch#1351)
* Refactor KIR Step 6 - Remove 
 kir::UnaryOp/BinaryOp/TernaryOp/ReductionOp/WelfordOp/BroadcastOp. (pytorch#1352)
* Refactor KIR Step 7 - Remove kir dispatch (pytorch#1353)
* Refactor KIR Step 8 - Clean up lower_utils (pytorch#1355)
* Refactor KIR Step 9 - lower_utils ir_utils::applyReplacements. (pytorch#1354)
* Refactor KIR Step 10 - Remove kir_printer in favor of io_stream (pytorch#1356)
@The1912
Copy link

The1912 commented May 2, 2022

I recently come across a situation where I need to load many small images. My work station has a CPU with 22 cores and four GPUs, so I run four experiments with different random seeds, and each experiment uses one seperate GPU. I find out that the run time of four processes is amost four times the run time of a single process (No parallel benefit).

The model I train is relatively small and the most time-consuming part acutally comes from data loading. I have tried many different approachs including:

  • pin_memory =False/True
  • num_workers = 0/1/8
  • Increase ulimit
  • staggering the start of each experiment

Thanks to the system level diagnosis by @vjorlikowski, we find out that if we set num_workers = 0/1/8, each process will try to use all CPU cores and viciously compete with each other for CPU cores.

Solution: Use export OMP_NUM_THREADS=N, as described here or use torch.set_num_threads(N), as described here We set num_workers = 0 and N=5 in our case, as we have 22 cores. The estimated run time of my program is reduced from 12 days to 1.5 days.

Thank u ,i fix it by torch.set_num_threads(N)

@opeide
Copy link

opeide commented May 25, 2022

For me the issue was apparently in my training augmentations. In albumentations there are some augmentations that can infiniteloop, like randomfog. I was only able to see where the code froze when i set num_workers=0.

@namespace-Pt
Copy link

any progress? came across the same issue when using DDP.

Everything works fine without DDP. However, when I create a dataloader for validation only on rank==0, this dataloader freezes if num_workers>0.

@pcicales
Copy link

@namespace-Pt I have experienced a similar issue, except with pin_memory=True when accumulating evaluation results. I have not yet tried @dem123456789 solution, but it seems to work for others. Would be great to have some official guidance on this issue from the devs though.

@namespace-Pt
Copy link

@pcicales Thanks but it not work for me. (torch==1.10.1+cu111)

@lifangda01
Copy link

The solution that worked for me is

import multiprocessing as mp

if __name__ == "__main__":
    mp.set_start_method('spawn')
    main()

After this I can use dataloader with num_workers > 0 and pin_memory = True without any problem.

@czy97
Copy link

czy97 commented Aug 7, 2022

What about this: Go to maskrcnn_benchmark/data/build.py In line 161 add "torch.multiprocessing.set_sharing_strategy('file_system')"

So the code will end like this:

    data_loaders = []
    
    torch.multiprocessing.set_sharing_strategy('file_system')
    for dataset in datasets:

That fixed it for me

This setting can solve the hanging problem, but may cause some other error:
File "/mnt/lustre/pytorch1_12/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 444, in iter
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
self._start_thread()

@thistlillo
Copy link

I have this problem as of August 2022. Dataloader freezes, mostly randomly, whenever I use num_workers > 0.

@HazelHik
Copy link

HazelHik commented Sep 6, 2022

same problem with pytorch 1.8 in anaconda. The training stuck after finishing the first epoch.
Output like:
Epoch 2
0%| | 0/118680 [00:00<?, ?it/s]
.... and hangs there for hours until I kill it
And it seems to happen randomly, I was able to reproduce this in a small dataset, but when I look into it using the debug mode in pycharm, it goes smooth.

@ker2xu
Copy link

ker2xu commented Sep 30, 2022

I would like to offer another solution for people who use the Python compiled from sources by themselves. Remember NOT to configure with --enable-profiling option. The --enable-profiling option will prevent you from using num_workers > 0 and all solutions above cannot help solve it.
Hope this reply helps.

@Hiroshiba
Copy link

In my case, specifying persistent_workers=True as a DataLoader argument was all I wanted.

I will write the detailed conditions. I hope it will be helpful to someone else.

  • I was using the latest official PyTorch Docker Image
  • DataLoader hung when a random number of epochs elapsed.
  • Nothing is shown in the error log.
  • It worked correctly with num_worker=0.
  • I don't use OpenCV, but I use wandb and tensorboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Issue Categories
Crashes / Segfaults / Errors
Issue Status
Uncategorized
Development

No branches or pull requests