RuntimeError: CUDA out of memory. #101

Deerzh · 2022-08-02T11:21:08Z

python train.py --pretrained --model_checkpoint thu-coai/CDial-GPT_LCCC-large --data_path data/STC.json --scheduler linear。
你好请问我的内存明明是够的，它为啥还报这个错误呢。batch_size我也改成了1.
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.73 GiB total capacity; 904.23 MiB already allocated; 26.38 MiB free; 1020.00 MiB reserved in total by PyTorch)
Epoch: [63/4391266] 0%| , loss=0.0535, lr=5e-5 [00:09<174:20:29
每次到63就结束了，请问4391266代表什么意思呢？可以缩小这个数值吗

ai408 · 2022-09-02T11:28:19Z

我也遇到了这个问题，请问解决了吗？

Deerzh · 2022-09-02T11:34:28Z

我也遇到了这个问题，请问解决了吗？
兄弟，目前我也没解决这个问题😂。我猜可能是数据集太大了，要是能缩小数据集估计能解决，但我不知道咋缩小

ai408 · 2022-09-02T11:36:49Z

我在尝试这个https://github.com/thu-coai/EVA

ai408 · 2022-09-02T11:38:07Z

我使用的数据量不是很大

Deerzh · 2022-09-02T11:40:57Z

那就不清楚了，可能需要作者解决一下

ai408 · 2022-09-02T11:43:45Z

貌似EVA对显存要求更高。

ai408 · 2022-09-02T11:58:40Z

修改num_workers为1就好了

silverriver · 2022-09-03T08:51:53Z

您好，您所使用的GPU显存可能有点小。碰到比较长的序列的话有可能因为要记录的激活太多导致OOM。您可以考虑限定一下训练过程中的最长序列长度，或者换一个大一点显存的显卡。

silverriver · 2022-09-03T08:53:19Z

修改num_workers为1就好了

num_workers 是 pytorch中DataLoader的参数，用来控制用多少个CPU进程来加载数据，这个数值的大小不会影响模型显存的占用的。

Deerzh · 2022-09-06T13:33:42Z

你好，请问如何缩小epoch呢？我在train.py中将--n_epochs改为1，为啥运行的时候还是这么大呢？
Epoch: [1709/2195633] 0%| , loss=0.0528, lr=5e-5 [01:48<38:44:15

ai408 · 2022-09-07T04:57:13Z

这个应该是修改batchsize吧

chenjh880730 · 2022-12-22T07:44:51Z

tesla v100 上跑一样out of memory.穷diaosi还是不要用了

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA out of memory. #101

RuntimeError: CUDA out of memory. #101

Deerzh commented Aug 2, 2022 •

edited

ai408 commented Sep 2, 2022

Deerzh commented Sep 2, 2022

ai408 commented Sep 2, 2022

ai408 commented Sep 2, 2022

Deerzh commented Sep 2, 2022

ai408 commented Sep 2, 2022 •

edited

ai408 commented Sep 2, 2022

silverriver commented Sep 3, 2022

silverriver commented Sep 3, 2022

Deerzh commented Sep 6, 2022

ai408 commented Sep 7, 2022

chenjh880730 commented Dec 22, 2022

RuntimeError: CUDA out of memory. #101

RuntimeError: CUDA out of memory. #101

Comments

Deerzh commented Aug 2, 2022 • edited

ai408 commented Sep 2, 2022

Deerzh commented Sep 2, 2022

ai408 commented Sep 2, 2022

ai408 commented Sep 2, 2022

Deerzh commented Sep 2, 2022

ai408 commented Sep 2, 2022 • edited

ai408 commented Sep 2, 2022

silverriver commented Sep 3, 2022

silverriver commented Sep 3, 2022

Deerzh commented Sep 6, 2022

ai408 commented Sep 7, 2022

chenjh880730 commented Dec 22, 2022

Deerzh commented Aug 2, 2022 •

edited

ai408 commented Sep 2, 2022 •

edited