Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CKBert继续预训练内存溢出 #349

Open
rainfallLLF opened this issue Feb 17, 2024 · 0 comments
Open

CKBert继续预训练内存溢出 #349

rainfallLLF opened this issue Feb 17, 2024 · 0 comments

Comments

@rainfallLLF
Copy link

rainfallLLF commented Feb 17, 2024

ckbert使用自己的领域语料继续预训练,发现语料一大(12GB),训练时间一久,机器就会自动重启,小语料(2G)的情况下没有出现问题。

遂训练时观察内存使用情况发现内存占用随着训练进度推进而逐渐增大,最终占完所有内存。

是否有大神面临同样的问题?十分感激能有人回复!

以下是我的训练参数:

export CUDA_VISIBLE_DEVICES=0,1

gpu_number=2
negative_e_number=4
negative_e_length=16

python -m torch.distributed.launch --nproc_per_node=$gpu_number
--master_port=52349
$base_dir/main.py
--mode=train
--worker_gpu=$gpu_number
--tables=$local_train_file,
--learning_rate=1e-3
--epoch_num=1
--logging_steps=100
--save_checkpoint_steps=1000
--sequence_length=512
--train_batch_size=4
--checkpoint_dir=$checkpoint_dir
--app_name=language_modeling
--use_amp
--save_all_checkpoints
--user_defined_parameters="pretrain_model_name_or_path=alibaba-pai/pai-ck_bert-base-zh external_mask_flag=True contrast_learning_flag=True negative_e_number=${negative_e_number} negative_e_length=${negative_e_length} kg_path=${local_kg}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant