Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA out of memory. #101

Open
Deerzh opened this issue Aug 2, 2022 · 12 comments
Open

RuntimeError: CUDA out of memory. #101

Deerzh opened this issue Aug 2, 2022 · 12 comments

Comments

@Deerzh
Copy link

Deerzh commented Aug 2, 2022

python train.py --pretrained --model_checkpoint thu-coai/CDial-GPT_LCCC-large --data_path data/STC.json --scheduler linear。
你好请问我的内存明明是够的,它为啥还报这个错误呢。batch_size我也改成了1.
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.73 GiB total capacity; 904.23 MiB already allocated; 26.38 MiB free; 1020.00 MiB reserved in total by PyTorch)
Epoch: [63/4391266] 0%| , loss=0.0535, lr=5e-5 [00:09<174:20:29
每次到63就结束了,请问4391266代表什么意思呢?可以缩小这个数值吗

@ai408
Copy link

ai408 commented Sep 2, 2022

我也遇到了这个问题,请问解决了吗?

@Deerzh
Copy link
Author

Deerzh commented Sep 2, 2022

我也遇到了这个问题,请问解决了吗?
兄弟,目前我也没解决这个问题😂。我猜可能是数据集太大了,要是能缩小数据集估计能解决,但我不知道咋缩小

@ai408
Copy link

ai408 commented Sep 2, 2022

我在尝试这个https://github.com/thu-coai/EVA

@ai408
Copy link

ai408 commented Sep 2, 2022

我使用的数据量不是很大

@Deerzh
Copy link
Author

Deerzh commented Sep 2, 2022

那就不清楚了,可能需要作者解决一下

@ai408
Copy link

ai408 commented Sep 2, 2022

貌似EVA对显存要求更高。

@ai408
Copy link

ai408 commented Sep 2, 2022

修改num_workers为1就好了

@silverriver
Copy link
Collaborator

您好,您所使用的GPU显存可能有点小。碰到比较长的序列的话有可能因为要记录的激活太多导致OOM。您可以考虑限定一下训练过程中的最长序列长度,或者换一个大一点显存的显卡。

@silverriver
Copy link
Collaborator

修改num_workers为1就好了

num_workers 是 pytorch中DataLoader的参数,用来控制用多少个CPU进程来加载数据,这个数值的大小不会影响模型显存的占用的。

@Deerzh
Copy link
Author

Deerzh commented Sep 6, 2022

你好,请问如何缩小epoch呢?我在train.py中将--n_epochs改为1,为啥运行的时候还是这么大呢?
Epoch: [1709/2195633] 0%| , loss=0.0528, lr=5e-5 [01:48<38:44:15

@ai408
Copy link

ai408 commented Sep 7, 2022

这个应该是修改batchsize吧

@chenjh880730
Copy link

tesla v100 上跑 一样out of memory.穷diaosi还是不要用了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants