New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Severe memory leaks when num_workers != 0 #7786
Comments
I haven't met this problem yet. And you can try the memory profiler hook which is supported in #7560 to monitor your memory use. |
Thank you @RangiLyu , Ill try to use the memory profiler. If anyone else is experiencing this, please let me know. Ill post results here when I get around to testing for leaks. |
I am experiencing same problem with almost the same environment. Severe memory leak rising at each epoch.... |
@RangiLyu @hhaAndroid @hellock perhaps this issue should be elevated? Seems like many are experiencing the same problem. This is not due to single epoch train/eval, as the memory consumption continues to increase with each completed epoch (i.e. not related to #1956). There are big jumps in cpu memory consumption after each eval round, which makes me think that the largest leaks are occurring when loading val/test annotations (seems like eval annotations are being re-copied in memory after each eval round). This is strange because I believe the eval hook only uses I have looked into this issue more with respect to my code, seems like there is a good chance that this is due to pytorch depending on how workers are instantiated/data is loaded. Let me know if I can help in any way in locating the leak, I wont have time in the next few weeks but should be more free to look carefully in the later half of May. pytorch/pytorch#13246 describes the issue with respect to pytorch and solutions that they implemented there. One of the issues is lists copying data content, which is a known cause of leaks when The solution they most often propose is here (replacing all instantiated lists with either np arrays or tensors): pytorch/pytorch#13246 (comment) Here is another solution if the leak is linked to multithreading with pytorch and opencv (also a common issue, this solution may impact performance, I mention it because the leak seems to be more severe when I use pipelines that leverage opencv, it may also mean that there are 2 leaks!): pytorch/pytorch#1355 (comment) Again thank you for your help with this! |
@choasup Was this during a single epoch? It seems that your graph is showing a leak of ~100Mb? I may be reading it incorrectly. |
@UninaLabs-EO are you running eval once after each training epoch? |
After each 3 training epochs
Il giorno dom 24 apr 2022 alle 20:58 Pietro Antonio Cicalese <
***@***.***> ha scritto:
… I am experiencing same problem with almost the same environment. Severe
memory leak rising at each epoch....
@UninaLabs-EO <https://github.com/UninaLabs-EO> are you running eval once
after each training epoch?
—
Reply to this email directly, view it on GitHub
<#7786 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARFBHLWGKYN4XPLA2BLKD5LVGWKUHANCNFSM5T74O3MA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Per far sicuro che ho capito, nelle tue config files, hai messo |
(Train,1),(Val,1)
Il giorno dom 24 apr 2022 alle 21:13 Pietro Antonio Cicalese <
***@***.***> ha scritto:
… After each 3 training epochs Il giorno dom 24 apr 2022 alle 20:58 Pietro
Antonio Cicalese < *@*.
*> ha scritto: … <#m_-6465906863566218698_> I am experiencing same problem
with almost the same environment. Severe memory leak rising at each
epoch.... @UninaLabs-EO <https://github.com/UninaLabs-EO>
https://github.com/UninaLabs-EO <https://github.com/UninaLabs-EO> are you
running eval once after each training epoch? — Reply to this email
directly, view it on GitHub <#7786 (comment)
<#7786 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ARFBHLWGKYN4XPLA2BLKD5LVGWKUHANCNFSM5T74O3MA
<https://github.com/notifications/unsubscribe-auth/ARFBHLWGKYN4XPLA2BLKD5LVGWKUHANCNFSM5T74O3MA>
. You are receiving this because you were mentioned.Message ID: @.*>
Per far sicuro che ho capito, nelle tue config files, hai messo workflow
= ('train', 1)?
—
Reply to this email directly, view it on GitHub
<#7786 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARFBHLVVMUTBNIJWPUEJFRTVGWMLZANCNFSM5T74O3MA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@UninaLabs-EO then it seems to be the same issue as me, if you see a large memory jump after each epoch. It suggests there is a leak when loading the eval annotations, which is strange because if I understand their scripts correctly, they keep the eval annotations in memory for each successive eval round (only loading them once when the eval hook is first called in the training pipeline). |
Correct but it also increases after each training epoch
Il giorno dom 24 apr 2022 alle ore 21:24 Pietro Antonio Cicalese <
***@***.***> ha scritto:
… (Train,1),(Val,1) Il giorno dom 24 apr 2022 alle 21:13 Pietro Antonio
Cicalese < *@*.
*> ha scritto: … <#m_-8285896307684385834_> After each 3 training epochs
Il giorno dom 24 apr 2022 alle 20:58 Pietro Antonio Cicalese < @. > ha
scritto: … <#m_-6465906863566218698_> I am experiencing same problem with
almost the same environment. Severe memory leak rising at each epoch....
@UninaLabs-EO <https://github.com/UninaLabs-EO>
https://github.com/UninaLabs-EO <https://github.com/UninaLabs-EO>
https://github.com/UninaLabs-EO <https://github.com/UninaLabs-EO>
https://github.com/UninaLabs-EO <https://github.com/UninaLabs-EO> are you
running eval once after each training epoch? — Reply to this email
directly, view it on GitHub <#7786
<#7786> (comment) <#7786
(comment)
<#7786 (comment)>>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ARFBHLWGKYN4XPLA2BLKD5LVGWKUHANCNFSM5T74O3MA
<https://github.com/notifications/unsubscribe-auth/ARFBHLWGKYN4XPLA2BLKD5LVGWKUHANCNFSM5T74O3MA>
https://github.com/notifications/unsubscribe-auth/ARFBHLWGKYN4XPLA2BLKD5LVGWKUHANCNFSM5T74O3MA
<https://github.com/notifications/unsubscribe-auth/ARFBHLWGKYN4XPLA2BLKD5LVGWKUHANCNFSM5T74O3MA>
. You are receiving this because you were mentioned.Message ID: @.> Per far
sicuro che ho capito, nelle tue config files, hai messo workflow =
('train', 1)? — Reply to this email directly, view it on GitHub <#7786
(comment)
<#7786 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ARFBHLVVMUTBNIJWPUEJFRTVGWMLZANCNFSM5T74O3MA
<https://github.com/notifications/unsubscribe-auth/ARFBHLVVMUTBNIJWPUEJFRTVGWMLZANCNFSM5T74O3MA>
. You are receiving this because you were mentioned.Message ID: @.*>
@UninaLabs-EO <https://github.com/UninaLabs-EO> then it seems to be the
same issue as me, if you see a large memory jump after epoch. It suggests
there is a leak when loading the eval annotations, which is strange because
if I understand their scripts correctly, they keep the eval annotations in
memory for each eval round and only load them once when eval is first
called in the training pipeline.
—
Reply to this email directly, view it on GitHub
<#7786 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARFBHLWL7JET4Y7WL372M3DVGWNWDANCNFSM5T74O3MA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@pcicales @UninaLabs-EO @choasup Thanks for your detailed report! One possible reason is that we use "fork“ as the default multi-processing start method in PR #6974 in order to speed up the data loader threads' starting time. You can set |
@pcicales |
Oh thanks could you expand on the first solution?
Il giorno lun 25 apr 2022 alle 04:29 Haian Huang(深度眸) <
***@***.***> ha scritto:
… Thank you very much for your detailed feedback. It is true that a large
part of mmdet's memory leaks come from the dataloader, which is a known
issue. If you want to solve this problem, one way is to use spawn to
multiple processes, and the other is to convert data_info to np.array
object. We also look forward to your feedback, thank you very much!
—
Reply to this email directly, view it on GitHub
<#7786 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARFBHLSU4UQQV5L3E3S3363VGX7RPANCNFSM5T74O3MA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@RangiLyu, I tried to set
in config to avoid memory leak, however, an error happened:
|
It is just to put : mp_start_method='spawn' in the config?
Il giorno gio 28 apr 2022 alle ore 14:10 Roberto Del Prete <
***@***.***> ha scritto:
… Do you have any guide/tutorials to solving this problem?
Il giorno lun 25 apr 2022 alle ore 08:44 Roberto Del Prete <
***@***.***> ha scritto:
> Oh thanks could you expand on the first solution?
>
> Il giorno lun 25 apr 2022 alle 04:29 Haian Huang(深度眸) <
> ***@***.***> ha scritto:
>
>> Thank you very much for your detailed feedback. It is true that a large
>> part of mmdet's memory leaks come from the dataloader, which is a known
>> issue. If you want to solve this problem, one way is to use spawn to
>> multiple processes, and the other is to convert data_info to np.array
>> object. We also look forward to your feedback, thank you very much!
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#7786 (comment)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ARFBHLSU4UQQV5L3E3S3363VGX7RPANCNFSM5T74O3MA>
>> .
>> You are receiving this because you were mentioned.Message ID:
>> ***@***.***>
>>
>
|
Do you have any guide/tutorials to solving this problem?
Il giorno lun 25 apr 2022 alle ore 08:44 Roberto Del Prete <
***@***.***> ha scritto:
… Oh thanks could you expand on the first solution?
Il giorno lun 25 apr 2022 alle 04:29 Haian Huang(深度眸) <
***@***.***> ha scritto:
> Thank you very much for your detailed feedback. It is true that a large
> part of mmdet's memory leaks come from the dataloader, which is a known
> issue. If you want to solve this problem, one way is to use spawn to
> multiple processes, and the other is to convert data_info to np.array
> object. We also look forward to your feedback, thank you very much!
>
> —
> Reply to this email directly, view it on GitHub
> <#7786 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ARFBHLSU4UQQV5L3E3S3363VGX7RPANCNFSM5T74O3MA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
any update? thanks |
any updates? |
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug
Any model trained on a single or multiple gpus with train num_workers > 0 will cause a severe cpu memory leak (no gpu memory leaks detected). The memory consumption follows a sawtooth pattern, gradually increasing until there is a memory error. I have now tested this with several datasets and models, mmcv_full == 1.4.8, mmdet == 2.23.0. Based on my investigation of the bug, it is unclear if this is due to pytorch or mmdetection code. Does your team have a protocol for detecting memory leaks in your code? It may be useful to identify the issue.
Tested with several default config models with default and custom datasets.
No changes made.
Environment
**sys.platform: linux
Python: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: Quadro RTX 8000
CUDA_HOME: /usr
NVCC: Cuda compilation tools, release 10.0, V10.0.130
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.11.0
PyTorch compiling details: PyTorch built with:
TorchVision: 0.12.0
OpenCV: 4.5.5
MMCV: 1.4.8
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.23.0+e97e900**
Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
The text was updated successfully, but these errors were encountered: