-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QuatE: GPU memory is not released per epoch #1351
Comments
I think this buffer should be unrelated (it is created only once, and pretty small, too). How does The default setting of QuatE (the full model configuration, not the interaction) uses a regularizer, and it seems as if you use a custom training loop, so my best guess would be that the regularization term keeps accumulating without being back-propagated; in this case, torch would not be able to release tensors from previous batches. |
Thanks for your answer and sorry for my late reply! I use the default QuatE model from the pykeen library here It looks like this
Are you talking about the |
I was talking about the two You can either
model = QuatE(
...,
entity_regularizer=None,
relation_regularizer=None,
)
As background info:
|
Describe the bug
Hi,
I am training the KGE Model QuatE on a cuda device and I am running into a
Cuda Out of Memory Error
after a few epochs.I have looked at the allocated memory at various points of the training loop. The allocated cuda memory increases with each training batch and also with each epoch, so that CUDA OOM occurs after a certain number of epochs.
Here is a graphic that visualises the problem.
I have also tested other KGE methods (BoxE, TransE, CrossE, ConvKB, RGCN, NTN) with the same code and have not found such problems with any of them. With them, the allocated memory remains constant per batch and epoch.
Do you have any hints where the problem comes from and how to fix it? I took a closer look at the
QuatEInteraction
and realised that a buffer called table is created. Could the problem perhaps lie here?How to reproduce
The code to instantiate the model looks like this:
The code that I use for one epoch looks like this
The
batch_size
is 256.Environment
GPU Quadro P5000
Python 3.11.0
torch 2.1.1
pykeen 1.10.1
Additional information
No response
Issue Template Checks
The text was updated successfully, but these errors were encountered: