Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with CUDA out of memory #23

Open
YIKMAT opened this issue Feb 18, 2022 · 7 comments
Open

Problems with CUDA out of memory #23

YIKMAT opened this issue Feb 18, 2022 · 7 comments

Comments

@YIKMAT
Copy link

YIKMAT commented Feb 18, 2022

I attempted to train the model using bash run/run_experiment.sh configs/amr2.0-structured-bart-large-sep-voc.sh and it looks like my 12GB 2080Ti GPU doesn't have enough memory.
In fact, I had two 12GB 2080Ti GPU on server, but only one of them used during training.
Does the code use multi-GPUs? Is there anything else I need to modify ?

@YIKMAT
Copy link
Author

YIKMAT commented Feb 18, 2022

Hi, here is the detail:
gpu0
gpu1

@ramon-astudillo
Copy link
Member

the code is single GPU, you can configure gradient accumulation (see $update_freq in configs)

@AngledLuffa
Copy link

I had tried reimplementing a general Stack Transformer, and I found that on long sequences, the memory cost of the stack got quite expensive. A softmax to compute attention in particular leads to quadratic growth, as each softmax gets progressively longer and needs to be kept until the optimizer step. Did you find a way to solve that, or is that related to the memory problems in this issue?

@ramon-astudillo
Copy link
Member

Sorry for the delay. I do not understand the question, stack-Transformer masks the attention of a normal transformer, and as such does not have any additional costs beyond mask computation

@AngledLuffa
Copy link

I meant, the backprop in a long sequence can get prohibitively expensive. When keeping the entire sequence, the softmax terms get longer and longer, and the early ones are kept until the end of the sequence unless you backprop each time step, so the total memory cost winds up being quadratic.

@ramon-astudillo
Copy link
Member

but that a property of Transformer, not stack-Transformer. That was my point. In that regard they are equal.

@AngledLuffa
Copy link

It is true that in the use case I tried it for (a transition based constituency parser), the transition sequences wound up being substantially longer than the sentence itself, and therefore the memory usage might be much higher for the transformer at the word input level

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants