Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

RAM leak problem when training in command batch_queue.get(). Where do you release resources once a batch is trained. #26

Open
vivasvan1 opened this issue Sep 18, 2020 · 3 comments

Comments

@vivasvan1
Copy link

vivasvan1 commented Sep 18, 2020

I have noticed that your training loop leaks small amounts of RAM memory.Any idea on what may have caused this?

time taken= 9.865329265594482 | steps= 1 | cpu= 51.8 | ram= 34.50078675328186 | gpu= [3101]
[5613]
time taken= 0.934636116027832 | steps= 2 | cpu= 27.0 | ram= 29.34866251942084 | gpu= [5613]
[3045]
time taken= 0.8695635795593262 | steps= 3 | cpu= 29.4 | ram= 29.217970957706278 | gpu= [3045]
[3021]
time taken= 0.8483304977416992 | steps= 4 | cpu= 29.8 | ram= 29.033316428574086 | gpu= [3021]
[2997]
time taken= 0.8630681037902832 | steps= 5 | cpu= 30.2 | ram= 28.87988403913803 | gpu= [2997]
[2997]
time taken= 0.8645083904266357 | steps= 6 | cpu= 29.4 | ram= 28.714746447210654 | gpu= [2997]
[2997]
time taken= 0.864253044128418 | steps= 7 | cpu= 29.3 | ram= 28.573093657739385 | gpu= [2997]
[2997]
time taken= 0.8693573474884033 | steps= 8 | cpu= 29.3 | ram= 28.389703885656044 | gpu= [2997]
[2997]
time taken= 0.8704898357391357 | steps= 9 | cpu= 29.4 | ram= 28.298690976454438 | gpu= [2997]
[2997]
time taken= 0.8670341968536377 | steps= 10 | cpu= 29.5 | ram= 28.13385097442091 | gpu= [2997]
[2997]
time taken= 0.8750414848327637 | steps= 11 | cpu= 29.5 | ram= 27.959884882309396 | gpu= [2997]
[2997]
time taken= 0.8624210357666016 | steps= 12 | cpu= 29.9 | ram= 27.784356443255188 | gpu= [2997]
[2997]
time taken= 0.8561670780181885 | steps= 13 | cpu= 29.8 | ram= 27.644241201568796 | gpu= [2997]
[2997]
time taken= 0.8609695434570312 | steps= 14 | cpu= 29.7 | ram= 27.51883186047002 | gpu= [2997]
[2997]
time taken= 0.8462607860565186 | steps= 15 | cpu= 29.7 | ram= 27.36641623650461 | gpu= [2997]
[2997]
time taken= 0.8624782562255859 | steps= 16 | cpu= 29.2 | ram= 27.23760941078441 | gpu= [2997]
[2997]
time taken= 0.8649694919586182 | steps= 17 | cpu= 29.4 | ram= 27.113514425050127 | gpu= [2997]
[2997]
time taken= 0.8661544322967529 | steps= 18 | cpu= 29.3 | ram= 27.004993310427178 | gpu= [2997]
[2997]
time taken= 0.8687705993652344 | steps= 19 | cpu= 29.8 | ram= 26.82090916192486 | gpu= [2997]
[2997]
time taken= 0.8823645114898682 | steps= 20 | cpu= 29.6 | ram= 26.688630454109777 | gpu= [2997]
[2997]
time taken= 0.8795809745788574 | steps= 21 | cpu= 29.4 | ram= 26.517987449146226 | gpu= [2997]
[2997]
time taken= 0.8857841491699219 | steps= 22 | cpu= 29.1 | ram= 26.40289455770082 | gpu= [2997]
[2997]
time taken= 0.8605339527130127 | steps= 23 | cpu= 29.5 | ram= 26.274509317663572 | gpu= [2997]
[2997]
time taken= 0.8524265289306641 | steps= 24 | cpu= 29.8 | ram= 26.16445065525575 | gpu= [2997]

@vivasvan1
Copy link
Author

vivasvan1 commented Sep 18, 2020

Can you check if it is only on my pc or is this happening with your code too?

also, is there any way in which I don't have to load the full dataset on the memory for training in mxnet?

@vivasvan1
Copy link
Author

vivasvan1 commented Sep 18, 2020

I have found using pdb that after every run of

batch = batch_queue.get()

an extra 0.10-0.15% ram is consumed which seems to never get released.

(Pdb) print("| ram=",psutil.virtual_memory().available * 100 / psutil.virtual_memory().total)
 **ram= 28.66921519345203** 
(Pdb) n
> /home/mask/maskflownet/MaskFlownet/main.py(572)<module>()
-> loading_time.update(default_timer() - t0)
(Pdb) print("| ram=",psutil.virtual_memory().available * 100 / psutil.virtual_memory().total)
**ram= 28.542640291935687**

I cannot find why this is happening but i am sure of it. Can you help me fix this please?

@vivasvan1 vivasvan1 changed the title RAM leak problem when training. RAM leak problem when training >> batch_queue.get(). Where do you release resources once a batch is trained. Sep 18, 2020
@vivasvan1 vivasvan1 changed the title RAM leak problem when training >> batch_queue.get(). Where do you release resources once a batch is trained. RAM leak problem when training in command batch_queue.get(). Where do you release resources once a batch is trained. Sep 18, 2020
@simon1727
Copy link
Contributor

Hi vivasvan1, thanks for pointing out this problem.

We import this Queue method from the python queue package directly without any modification. I search on Google and find that other people encounter the same problem. So maybe this is not a problem with our code but the python queue package.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants