Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe memory leaks when num_workers != 0 #7786

Open
pcicales opened this issue Apr 21, 2022 · 20 comments
Open

Severe memory leaks when num_workers != 0 #7786

pcicales opened this issue Apr 21, 2022 · 20 comments
Assignees

Comments

@pcicales
Copy link

pcicales commented Apr 21, 2022

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug

Any model trained on a single or multiple gpus with train num_workers > 0 will cause a severe cpu memory leak (no gpu memory leaks detected). The memory consumption follows a sawtooth pattern, gradually increasing until there is a memory error. I have now tested this with several datasets and models, mmcv_full == 1.4.8, mmdet == 2.23.0. Based on my investigation of the bug, it is unclear if this is due to pytorch or mmdetection code. Does your team have a protocol for detecting memory leaks in your code? It may be useful to identify the issue.

  1. What command or script did you run?

Tested with several default config models with default and custom datasets.

  1. Did you make any modifications on the code or config? Did you understand what you have modified?

No changes made.

  1. What dataset did you use?

Environment

**sys.platform: linux
Python: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: Quadro RTX 8000
CUDA_HOME: /usr
NVCC: Cuda compilation tools, release 10.0, V10.0.130
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.11.0
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.3
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.2
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.12.0
OpenCV: 4.5.5
MMCV: 1.4.8
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.23.0+e97e900**

Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

@RangiLyu
Copy link
Member

I haven't met this problem yet. And you can try the memory profiler hook which is supported in #7560 to monitor your memory use.

@pcicales
Copy link
Author

Thank you @RangiLyu , Ill try to use the memory profiler.

If anyone else is experiencing this, please let me know. Ill post results here when I get around to testing for leaks.

@sirbastiano
Copy link

I am experiencing same problem with almost the same environment. Severe memory leak rising at each epoch....

@choasup
Copy link

choasup commented Apr 24, 2022

image

Same problem, memory leak...

@pcicales
Copy link
Author

pcicales commented Apr 24, 2022

@RangiLyu @hhaAndroid @hellock perhaps this issue should be elevated? Seems like many are experiencing the same problem. This is not due to single epoch train/eval, as the memory consumption continues to increase with each completed epoch (i.e. not related to #1956). There are big jumps in cpu memory consumption after each eval round, which makes me think that the largest leaks are occurring when loading val/test annotations (seems like eval annotations are being re-copied in memory after each eval round). This is strange because I believe the eval hook only uses num_workers == 0. Also, testing more I found the same leak (less severe, which is why I didn't spot it earlier) when num_workers == 1 (corrected the issue title and post).

I have looked into this issue more with respect to my code, seems like there is a good chance that this is due to pytorch depending on how workers are instantiated/data is loaded. Let me know if I can help in any way in locating the leak, I wont have time in the next few weeks but should be more free to look carefully in the later half of May.

pytorch/pytorch#13246 describes the issue with respect to pytorch and solutions that they implemented there. One of the issues is lists copying data content, which is a known cause of leaks when num_workers > 0 in pytorch. I believe some lists are instantiated in your dataloader scripts in the same way that is mentioned in that git thread, let me know if I am wrong (I am referring to the data_infos list generated in load_annotations for COCO, among other lists in other data pipeline scripts). That pytorch issue has been open for a long time, but you will see that it persists in the recent releases of pytorch. I suspect this may be the cause of atleast one leak in mmdetection.

The solution they most often propose is here (replacing all instantiated lists with either np arrays or tensors): pytorch/pytorch#13246 (comment)

Here is another solution if the leak is linked to multithreading with pytorch and opencv (also a common issue, this solution may impact performance, I mention it because the leak seems to be more severe when I use pipelines that leverage opencv, it may also mean that there are 2 leaks!): pytorch/pytorch#1355 (comment)

Again thank you for your help with this!

@pcicales pcicales changed the title Severe memory leaks when num_workers != 0/1 Severe memory leaks when num_workers != 0 Apr 24, 2022
@pcicales
Copy link
Author

pcicales commented Apr 24, 2022

image

Same problem, memory leak...

@choasup Was this during a single epoch? It seems that your graph is showing a leak of ~100Mb? I may be reading it incorrectly.

@pcicales
Copy link
Author

I am experiencing same problem with almost the same environment. Severe memory leak rising at each epoch....

@UninaLabs-EO are you running eval once after each training epoch?

@sirbastiano
Copy link

sirbastiano commented Apr 24, 2022 via email

@pcicales
Copy link
Author

pcicales commented Apr 24, 2022

After each 3 training epochs Il giorno dom 24 apr 2022 alle 20:58 Pietro Antonio Cicalese < @.> ha scritto:

I am experiencing same problem with almost the same environment. Severe memory leak rising at each epoch.... @UninaLabs-EO https://github.com/UninaLabs-EO are you running eval once after each training epoch? — Reply to this email directly, view it on GitHub <#7786 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARFBHLWGKYN4XPLA2BLKD5LVGWKUHANCNFSM5T74O3MA . You are receiving this because you were mentioned.Message ID: @.
>

Per far sicuro che ho capito, nelle tue config files, hai messo workflow = ('train', 3)? Vedi il memory leak di più dopo ogni eval round?

@sirbastiano
Copy link

sirbastiano commented Apr 24, 2022 via email

@pcicales
Copy link
Author

pcicales commented Apr 24, 2022

(Train,1),(Val,1) Il giorno dom 24 apr 2022 alle 21:13 Pietro Antonio Cicalese < @.> ha scritto:

After each 3 training epochs Il giorno dom 24 apr 2022 alle 20:58 Pietro Antonio Cicalese < @. > ha scritto: … <#m_-6465906863566218698_> I am experiencing same problem with almost the same environment. Severe memory leak rising at each epoch.... @UninaLabs-EO https://github.com/UninaLabs-EO https://github.com/UninaLabs-EO https://github.com/UninaLabs-EO are you running eval once after each training epoch? — Reply to this email directly, view it on GitHub <#7786 (comment) <#7786 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARFBHLWGKYN4XPLA2BLKD5LVGWKUHANCNFSM5T74O3MA https://github.com/notifications/unsubscribe-auth/ARFBHLWGKYN4XPLA2BLKD5LVGWKUHANCNFSM5T74O3MA . You are receiving this because you were mentioned.Message ID: @.> Per far sicuro che ho capito, nelle tue config files, hai messo workflow = ('train', 1)? — Reply to this email directly, view it on GitHub <#7786 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARFBHLVVMUTBNIJWPUEJFRTVGWMLZANCNFSM5T74O3MA . You are receiving this because you were mentioned.Message ID: @.
>

@UninaLabs-EO then it seems to be the same issue as me, if you see a large memory jump after each epoch. It suggests there is a leak when loading the eval annotations, which is strange because if I understand their scripts correctly, they keep the eval annotations in memory for each successive eval round (only loading them once when the eval hook is first called in the training pipeline).

@sirbastiano
Copy link

sirbastiano commented Apr 24, 2022 via email

@RangiLyu
Copy link
Member

RangiLyu commented Apr 25, 2022

@pcicales @UninaLabs-EO @choasup Thanks for your detailed report! One possible reason is that we use "fork“ as the default multi-processing start method in PR #6974 in order to speed up the data loader threads' starting time. You can set mp_start_method = 'spawn' to see if the memory leak problem persists.

@hhaAndroid
Copy link
Collaborator

hhaAndroid commented Apr 25, 2022

@pcicales
Thank you very much for your detailed feedback. It is true that a large part of mmdet's memory leaks come from the dataloader, which is a known issue. If you want to solve this problem, one way is to use spawn to multiple processes, and the other is to convert data_info to np.array object. We also look forward to your feedback, thank you very much!

@sirbastiano
Copy link

sirbastiano commented Apr 25, 2022 via email

@liming-ai
Copy link

@pcicales @UninaLabs-EO @choasup Thanks for your detailed report! One possible reason is that we use "fork“ as the default multi-processing start method in PR #6974 in order to speed up the data loader threads' starting time. You can set mp_start_method = 'spawn' to see if the memory leak problem persists.

@RangiLyu, I tried to set

opencv_num_threads = 0
mp_start_method = 'spawn'

in config to avoid memory leak, however, an error happened:

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 718782 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 718783 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 718785 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 718786 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 718787 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 718788 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 718789 closing signal SIGTERM
/usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
  len(cache))
/usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
  len(cache))
/usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
  len(cache))
/usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
  len(cache))
/usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
  len(cache))
/usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
  len(cache))
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 718784) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError

@sirbastiano
Copy link

sirbastiano commented Oct 11, 2022 via email

@sirbastiano
Copy link

sirbastiano commented Oct 11, 2022 via email

@xrrain
Copy link

xrrain commented May 8, 2023

any update? thanks

@samernanoxx
Copy link

any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants