New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possible deadlock in dataloader #1355
Comments
How much free memory do you have when the loader stops? |
@apaszke if I check |
Also I don't understand why it always stops at beginning of validation, but not everywhere else. |
Possibly because for validation a separate loader is used that pushes the use of shared memory over the limit. |
I just ran the program again. And got stuck. Output of
Output of
Output of
I don't think it's a memory issue. |
There are separate limits for shared memory. Can you try |
how do they look for you? as for fewer workers, I believe it won't happen that often. (I can try now). But I think in practice I need that many workers. |
You have a max of 4096 shared memory segments allowed, maybe that's an issue. You can try increasing that by writing to |
@apaszke well these are default values by both Ubuntu and CentOS 6... Is that really an issue? |
@apaszke when running training program, |
@apaszke tried running the program (still 22 workers) with following setting on shared mem, and stuck again.
didn't try one worker. first, that would be slow; second, if the problem is really dead locking, then it would definitely disappear. |
@zym1010 default settings doesn't have to be created with such workloads in mind, so yes it might have been an issue. It wouldn't definitely disappear, because if the problem is really there, then it's likely a deadlock between the worker and main process, and one worker might be enough to trigger this. Anyway, I can't fix the issue until I can reproduce it. What are the parameters you're using to run the example and did you modify the code in any way? Also, what's the value of |
@apaszke Thanks. I understand your analysis much better now. All other results shown to you up to how are performed on a Ubuntu 14.04 machine with 64GB RAM, dual Xeon, and Titan Black (there's also a K40, but I didn't use it). The command to generate the problem is I installed pytorch through pip, on Python 3.5. pytorch version is BTW, I also tried using 1 worker. But I did it on another machine (128GB RAM, dual Xeon, 4 Pascal Titan X, CentOS 6). I ran it using
the
|
another thing I found is that, if I modified the training code, so that it won't go through all batches, let say, only train 50 batches
then the deadlock seems to disappear. |
further testing seems to suggest that, this freezing much more frequently happens if I ran the program just after rebooting the computer. After there's some cache in the computer, seems that the frequency of getting this freezing is less. |
I tried, but I can't reproduce this bug in any way. |
I met a similar issue: the data loader stops when I finish an epoch and will start a new epoch. |
Setting num_workers = 0 works. But the program slows down. |
@apaszke have you tried first rebooting the computer and then running the programs? For me, this guarantees the freezing. I just tried 0.12 version, and it's still the same. One thing I'd like to point out is that I installed the pytorch using So essentially pytorch is using MKL and numpy is using OpenBLAS. This may not be ideal, but I think this should have nothing to do with the issue here. |
I looked into it, but I could never reproduce it. MKL/OpenBLAS should be unrelated to this problem. It's probably some problem with a system configuration |
@apaszke thanks. I just tried the python from anaconda official repo and MKL based pytorch. Still the same problem. |
tried running the code in Docker. Still stuck. |
We have the same problem, running the pytorch/examples imagenet training example (resnet18, 4 workers) inside an nvidia-docker using 1 GPU out of 4. I'll try to gather a gdb backtrace, if I manage to get to the process. At least OpenBLAS is known to have a deadlock issue in matrix multiplication, which occurs relatively rarely: OpenMathLib/OpenBLAS#937. This bug was present at least in OpenBLAS packaged in numpy 1.12.0. |
@jsainio I also tried pure MKL based PyTorch (numpy is linked with MKL as well), and same problem. Also, this problem is solved (at least for me), if I turn of |
It looks as if two of the workers die out. During normal operation:
after locking up:
For one still remaining workers, the beginning of the gdb stacktrace looks like:
|
I had similar error log, with the main process stuck on: self.data_queue.get() If you said it's working for you with num_workers = 0 it's not that. But I thought it might help some people with similar error trace. |
I'm running a test with With It looks something like a race condition in ImageLoader which might be triggered relatively rarely by a certain hardware/software combination. |
@zym1010 thanks for the pointer, I'll try setting |
To those who still are stuck by the problem even if all above methods have been applied, remember to use |
I also encountered a similar problem. I simplified the train.py to only contain |
I have tried everything mentioned here:
Still for an object detection problem I am working on I get a training deadlock after the first epoch of training 100% of the time when using DDP where the training process gets stuck waiting on DataLoader. If I don't use DDP, there is no deadlock, that is the only thing that fixes the issue for me. This is with PyTorch 1.10.0 / CUDA 11.3 and PyTorch 1.8.1 / CUDA 10.2. Essentially what happens is at the start of training there are 3 processes when doing DDP with 0 workers and 1 GPU. When the hang happens, the main training process gets stuck on iterating over the dataloader and goes to 0% CPU usage. The other two processes are at 100% CPU. When using two GPU, we start with 4 processes, one of the training process hangs, the other one uses 100% CPU/GPU, and the other two process use 100% CPU. I have not been able to get a stack trace for any of the other processes unfortunately. Non DDP training works flawlessly. |
not fix yet... |
I recently come across a situation where I need to load many small images. My work station has a CPU with 22 cores and four GPUs, so I run four experiments with different random seeds, and each experiment uses one seperate GPU. I find out that the run time of four processes is amost four times the run time of a single process (No parallel benefit). The model I train is relatively small and the most time-consuming part acutally comes from data loading. I have tried many different approachs including:
Thanks to the system level diagnosis by @vjorlikowski, we find out that if we set num_workers = 0/1/8, each process will try to use all CPU cores and viciously compete with each other for CPU cores. Solution: |
* Refactor War Sync Insertion Pass (pytorch#1339) * Remove kir::Expr::scope_ (pytorch#1341) * Fusion IR Refactor (pytorch#1343) * Refactor KIR Step 1 - Remove kir::Node (pytorch#1347) * Refactor KIR Step 2 - TMP IrUtils change (pytorch#1348) * Refactor KIR Step 3 - Remove kir::Expr and kir::Val. (pytorch#1349) * Refactor KIR Step 4 - Remove kir::Bool,Double,Int,NamedScalar. (pytorch#1350) * Refactor KIR Step 5 - Remove kir::IterDomain/TensorDomain/TensorView (pytorch#1351) * Refactor KIR Step 6 - Remove kir::UnaryOp/BinaryOp/TernaryOp/ReductionOp/WelfordOp/BroadcastOp. (pytorch#1352) * Refactor KIR Step 7 - Remove kir dispatch (pytorch#1353) * Refactor KIR Step 8 - Clean up lower_utils (pytorch#1355) * Refactor KIR Step 9 - lower_utils ir_utils::applyReplacements. (pytorch#1354) * Refactor KIR Step 10 - Remove kir_printer in favor of io_stream (pytorch#1356)
Thank u ,i fix it by torch.set_num_threads(N) |
For me the issue was apparently in my training augmentations. In albumentations there are some augmentations that can infiniteloop, like randomfog. I was only able to see where the code froze when i set num_workers=0. |
any progress? came across the same issue when using DDP. Everything works fine without DDP. However, when I create a dataloader for validation only on |
@namespace-Pt I have experienced a similar issue, except with |
@pcicales Thanks but it not work for me. (torch==1.10.1+cu111) |
The solution that worked for me is
After this I can use dataloader with |
This setting can solve the hanging problem, but may cause some other error: |
I have this problem as of August 2022. Dataloader freezes, mostly randomly, whenever I use num_workers > 0. |
same problem with pytorch 1.8 in anaconda. The training stuck after finishing the first epoch. |
I would like to offer another solution for people who use the Python compiled from sources by themselves. Remember NOT to |
In my case, specifying I will write the detailed conditions. I hope it will be helpful to someone else.
|
the bug is described at pytorch/examples#148. I just wonder if this is a bug in PyTorch itself, as the example code looks clean to me. Also, I wonder if this is related to #1120.
The text was updated successfully, but these errors were encountered: