Problems with CUDA out of memory #23

YIKMAT · 2022-02-18T03:27:53Z

I attempted to train the model using bash run/run_experiment.sh configs/amr2.0-structured-bart-large-sep-voc.sh and it looks like my 12GB 2080Ti GPU doesn't have enough memory.
In fact, I had two 12GB 2080Ti GPU on server, but only one of them used during training.
Does the code use multi-GPUs? Is there anything else I need to modify ?

The text was updated successfully, but these errors were encountered:

YIKMAT · 2022-02-18T03:30:14Z

Hi, here is the detail:

ramon-astudillo · 2022-06-28T15:33:18Z

the code is single GPU, you can configure gradient accumulation (see $update_freq in configs)

AngledLuffa · 2022-09-20T15:19:34Z

I had tried reimplementing a general Stack Transformer, and I found that on long sequences, the memory cost of the stack got quite expensive. A softmax to compute attention in particular leads to quadratic growth, as each softmax gets progressively longer and needs to be kept until the optimizer step. Did you find a way to solve that, or is that related to the memory problems in this issue?

ramon-astudillo · 2022-11-22T20:08:55Z

Sorry for the delay. I do not understand the question, stack-Transformer masks the attention of a normal transformer, and as such does not have any additional costs beyond mask computation

AngledLuffa · 2022-11-22T20:42:13Z

I meant, the backprop in a long sequence can get prohibitively expensive. When keeping the entire sequence, the softmax terms get longer and longer, and the early ones are kept until the end of the sequence unless you backprop each time step, so the total memory cost winds up being quadratic.

ramon-astudillo · 2023-05-10T17:05:53Z

but that a property of Transformer, not stack-Transformer. That was my point. In that regard they are equal.

AngledLuffa · 2023-05-10T18:53:19Z

It is true that in the use case I tried it for (a transition based constituency parser), the transition sequences wound up being substantially longer than the sentence itself, and therefore the memory usage might be much higher for the transformer at the word input level

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with CUDA out of memory #23

Problems with CUDA out of memory #23

YIKMAT commented Feb 18, 2022

YIKMAT commented Feb 18, 2022

ramon-astudillo commented Jun 28, 2022

AngledLuffa commented Sep 20, 2022

ramon-astudillo commented Nov 22, 2022

AngledLuffa commented Nov 22, 2022

ramon-astudillo commented May 10, 2023

AngledLuffa commented May 10, 2023

Problems with CUDA out of memory #23

Problems with CUDA out of memory #23

Comments

YIKMAT commented Feb 18, 2022

YIKMAT commented Feb 18, 2022

ramon-astudillo commented Jun 28, 2022

AngledLuffa commented Sep 20, 2022

ramon-astudillo commented Nov 22, 2022

AngledLuffa commented Nov 22, 2022

ramon-astudillo commented May 10, 2023

AngledLuffa commented May 10, 2023