Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Free some resources after each step to avoid OOM #75

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

andrewmoise
Copy link

Fixes #74

@lucidrains
Copy link
Owner

hmm, i think the GC should handle this?

@andrewmoise
Copy link
Author

For some reason it seems like it doesn't until the variable goes out of scope - see https://pytorch.org/docs/stable/notes/faq.html ("If you assign a Tensor or Variable to a local, Python will not deallocate until the local goes out of scope.")

IDK whether casting and deleting the loss is necessary, as it made no difference in my run (I was just obeying the pytorch docs above). But, deleting the sample data after saving definitely caused my test runs to succeed where previously they were running out of GPU memory.

@andrewmoise
Copy link
Author

(I should clarify - my understanding is that the resources will be freed by the GC when the variable is reassigned on the next step. We're not leaking significant memory on an ongoing basis. Just, we're consuming a constant amount of memory we don't need to be by keeping old local variables around with large storage on the GPU, after they're done being used but before they're reassigned on the next loop. For me that was enough to push the consumption over the edge and make the program quit because it had no more GPU memory.)

@pengzhangzhi
Copy link
Contributor

(I should clarify - my understanding is that the resources will be freed by the GC when the variable is reassigned on the next step. We're not leaking significant memory on an ongoing basis. Just, we're consuming a constant amount of memory we don't need to be by keeping old local variables around with large storage on the GPU, after they're done being used but before they're reassigned on the next loop. For me that was enough to push the consumption over the edge and make the program quit because it had no more GPU memory.)

May I ask what GC stands for?

@mgrachten
Copy link

mgrachten commented Dec 14, 2022

I've seen this issue in the past, and I think what @andrewmoise proposes makes sense. The garbage collector (GC) doesn't handle this situation by itself. For example, in the following for loop, the computational graph gets constructed by compute_loss(batch), and is assigned to loss. In the next iteration, loss is still bound to the graph of the previous iteration, a new graph is constructed by compute_loss(batch), and only when it is constructed loss is bound to that new graph, and the ref count to the old graph is decreased. That means that this code requires enough memory to hold two graphs simultaneously, unless you decrease the ref count to the old graph before computing the new graph, e.g. by del loss at the end of each iteration. Only then will the GC be able to get rid of the old graph.

for batch in loader:
    loss = compute_loss(batch)

@VimukthiRandika1997
Copy link

@lucidrains Is this resolved now ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running out of memory
5 participants