Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAM is increasing over time #484

Open
maytusp opened this issue Apr 28, 2021 · 8 comments
Open

RAM is increasing over time #484

maytusp opened this issue Apr 28, 2021 · 8 comments
Labels

Comments

@maytusp
Copy link

maytusp commented Apr 28, 2021

I have implemented vizdoom tasks following "learning_pytorch.py" in an example folder. I found that memory usage was increasing as the training epoch. The task "health gathering" in the late epoch sometimes utilized 100% of memory and then it was a breakdown. Sometimes, the memory stopped increasing at around 80-90% and it can run until the last epoch. How can I fix this issue?

@Miffyli
Copy link
Collaborator

Miffyli commented Apr 28, 2021

Learning examples use Q-learning with replay buffer, which use a lot of memory, especially with images. You can try setting the replay_memory_size parameter to something lower, but this can make the training less stable.

@maytusp
Copy link
Author

maytusp commented Apr 28, 2021

Learning examples use Q-learning with replay buffer, which use a lot of memory, especially with images. You can try setting the replay_memory_size parameter to something lower, but this can make the training less stable.

In that case, should memory be full since the early epoch?

What I found is that memory in early epochs remains the same, but the memory in later epochs radically increases.

As in this figure: https://drive.google.com/file/d/1LXvwdr6g_5kLLlOW6tDWS2NQuX2vhjxQ/view?usp=sharing

@Miffyli
Copy link
Collaborator

Miffyli commented Apr 28, 2021

In that case, should memory be full since the early epoch?

Not necessarily. Depending on how the buffer is created, it might slowly fill up as training goes. This is how the example does with just appending stuff into a Python list

What I found is that memory in early epochs remains the same, but the memory in later epochs radically increases.

Ah, that is not normal and I see the problem! Can you provide the exact code you run and system information (Python version, library versions, ViZDoom versions etc)? Is it the Python script that eats the memory or ZDoom process?

@maytusp
Copy link
Author

maytusp commented Apr 28, 2021

Ah, that is not normal and I see the problem! Can you provide the exact code you run and system information (Python version, library versions, ViZDoom versions etc)? Is it the Python script that eats the memory or ZDoom process?

It's mostly the same as "learning_pytorch.py". Additionally, I changed the network's architecture, hyperparameters, screen resolution, and RGB display (from original gray).

Here is my code: https://colab.research.google.com/drive/1OpUPvs7h2vrHWNduS2FWenmN87b1CgoI?usp=sharing

with
python 3.6.12
torch 1.7.1
vizdoom 1.1.8
scikit-image 0.14.5
numpy 1.19.4

@Miffyli
Copy link
Collaborator

Miffyli commented Apr 28, 2021

I ran the code for 20min on a Ubuntu 20.04 machine with Python 3.7, ViZDoom 1.1.8 and PyTorch 1.8.1, and the memory usage maxed out at 9GB for me (for whole system), with no noticeable sudden spikes. I would still try reducing the image size (RGB + bigger resolution takes a lot of space. Normally gray image of size (64, 64) or so is enough even for complex ViZDoom scenarios).

@maytusp
Copy link
Author

maytusp commented Apr 28, 2021

I ran the code for 20min on a Ubuntu 20.04 machine with Python 3.7, ViZDoom 1.1.8 and PyTorch 1.8.1, and the memory usage maxed out at 9GB for me (for whole system), with no noticeable sudden spikes. I would still try reducing the image size (RGB + bigger resolution takes a lot of space. Normally gray image of size (64, 64) or so is enough even for complex ViZDoom scenarios).

I ran it on 64GB ram. So, there may be something wrong with the system or software, not my code?

@Miffyli
Copy link
Collaborator

Miffyli commented Apr 28, 2021

I ran it on 64GB ram. So, there may be something wrong with the system or software, not my code?

The sudden spike seems to indicate so. However now as I continue running the memory use does increase, but this was to be expected with such large observations. Again, I recommend you to try out reducing resolution size and go back to gray images. I also recommend that you use a formal RL library to run experiments (e.g. stable-baselines3), where implementations are known to work out.

Edit: Just after I wrote this the memory use spiked on my machine as well. It seems to come from the Python code (probably some tensors duplicated recursively). I do not have time currently to fix this sadly, but if you debug this out we would be happy to accept a fix. I still recommend you to try out formal libraries if you wish to experiment with RL algorithms :)

@maytusp
Copy link
Author

maytusp commented Apr 28, 2021

I ran it on 64GB ram. So, there may be something wrong with the system or software, not my code?

The sudden spike seems to indicate so. However now as I continue running the memory use does increase, but this was to be expected with such large observations. Again, I recommend you to try out reducing resolution size and go back to gray images. I also recommend that you use a formal RL library to run experiments (e.g. stable-baselines3), where implementations are known to work out.

Thank you very much, I will follow your suggestion.
I really appreciate your quick reponse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants