How to determine the --batch_size when training from scratch? #848

lishuai-97 · 2024-03-28T08:02:51Z

lishuai-97
Mar 28, 2024

How do you determine the --batch_size of ViT-B-32 model for Laion-2B? For example, this run uses --batch-size = 352 when trian on 8 NVIDIA A100-SXM4-40GB GPUs.

However, when I was training from scratch on a ViT-B-32 model with 8 x 4090 GPUs using the CC3M dataset (download from here by using the datasets package) with the script below, I found that a batch-size of 352 only occupied 7-8GB of VRAM on each GPU. Furthermore, the batch-size could be set up to 2048, which roughly consumed about 20GB of VRAM per GPU. However, in the provided wandb settings, you used A100 GPUs with 40GB VRAM, but your batch-size was only 352, which is quite odd. Apart from the difference in datasets, all my other parameters are the same as your wandb hyperparameters. So, I would like to know, is there any unmentioned requirement for the batch-size per GPU when training from scratch? Or did I miss any other parameter settings?

# Single-Node
torchrun --nproc_per_node 8 -m training.main \
    --save-frequency 10 \
    --train-data 'data/cc3m/cc3m-train-{0000..0575}.tar::data/cc3m/cc3m-validation-{0000..0015}.tar' \
    --train-num-samples 135646078 \
    --dataset-type webdataset \
    --precision amp_bf16 \
    --warmup 5000 \
    --batch-size 352 \
    --epochs 150 \
    --dataset-resampled \
    --lr 2e-3 \
    --beta1 0.9 \
    --beta2 0.99 \
    --lr-scheduler cosine \
    --wd 0.2 \
    --force-patch-dropout 0.5 \
    --report-to tensorboard \
    --workers 4 \
    --model ViT-B-32 \
    --name "ViT-B-32-Vanilla-1" \
    --log-every-n-steps 1 \
    --seed 0 \
    --ddp-static-graph \
    --local-loss \
    --gather-with-grad \
    --grad-checkpointing \

@mitchellnw @rom1504 @rwightman

Answered by rom1504

Mar 28, 2024

Usually a total batch size of at least 32k is needed to get good results with clip. Increasing up to 64k sometimes helps

View full answer

rom1504 · 2024-03-28T08:10:34Z

rom1504
Mar 28, 2024
Maintainer

Usually a total batch size of at least 32k is needed to get good results with clip. Increasing up to 64k sometimes helps

…

On Thu, Mar 28, 2024, 9:03 AM ShuaiLi ***@***.***> wrote: How do you determine the --batch_size of ViT-B-32 model for Laion-2B? For example, this run <https://wandb.ai/rom1504/open-clip/runs/2xpw65p8/overview?workspace=> uses --batch-size = 352 when trian on 8 NVIDIA A100-SXM4-40GB GPUs. However, when I was training from scratch on a ViT-B-32 model with 8 x 4090 GPUs using the CC3M dataset (download from here <https://huggingface.co/datasets/pixparse/cc3m-wds> by using the datasets package) with the script below, I found that a batch-size of 352 only occupied 7-8GB of VRAM on each GPU. Furthermore, the batch-size could be set up to 2048, which roughly consumed about 20GB of VRAM per GPU. However, in the provided wandb settings, you used A100 GPUs with 40GB VRAM, but your batch-size was only 352, which is quite odd. Apart from the difference in datasets, all my other parameters are the same as your wandb hyperparameters. So, I would like to know, is there any unmentioned requirement for the batch-size per GPU when training from scratch? Or did I miss any other parameter settings? # Single-Node torchrun --nproc_per_node 8 -m training.main \ --save-frequency 10 \ --train-data 'data/cc3m/cc3m-train-{0000..0575}.tar::data/cc3m/cc3m-validation-{0000..0015}.tar' \ --train-num-samples 135646078 \ --dataset-type webdataset \ --precision amp_bf16 \ --warmup 5000 \ --batch-size 352 \ --epochs 150 \ --dataset-resampled \ --lr 2e-3 \ --beta1 0.9 \ --beta2 0.99 \ --lr-scheduler cosine \ --wd 0.2 \ --force-patch-dropout 0.5 \ --report-to tensorboard \ --workers 4 \ --model ViT-B-32 \ --name "ViT-B-32-Vanilla-1" \ --log-every-n-steps 1 \ --seed 0 \ --ddp-static-graph \ --local-loss \ --gather-with-grad \ --grad-checkpointing \ @mitchellnw <https://github.com/mitchellnw> @rom1504 <https://github.com/rom1504> @rwightman <https://github.com/rwightman> — Reply to this email directly, view it on GitHub <#848>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437VKIHMH4CZ23JDDGDTY2PFEDAVCNFSM6AAAAABFMJWVKGVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWGQZTANZRGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

2 replies

lishuai-97 Mar 28, 2024
Author

Thank you for your suggestion, but I think I did not clearly express my confusion. In your wandb description, you used an A100 GPU with 40GB of VRAM and set the batch-size to 352, whereas I used a 4090 GPU with 24GB of VRAM and was able to set the batch-size to 2048. I am wondering if this is reasonable.

lishuai-97 Mar 28, 2024
Author

Thank you, I have resolved the confusion. After using gradient accumulation to increase the total batch size, the batch-size parameter became consistent with the scale of your experiments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to determine the --batch_size when training from scratch? #848

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

How to determine the --batch_size when training from scratch? #848

lishuai-97 Mar 28, 2024

Replies: 1 comment · 2 replies

rom1504 Mar 28, 2024 Maintainer

lishuai-97 Mar 28, 2024 Author

lishuai-97 Mar 28, 2024 Author

lishuai-97
Mar 28, 2024

Replies: 1 comment 2 replies

rom1504
Mar 28, 2024
Maintainer

lishuai-97 Mar 28, 2024
Author

lishuai-97 Mar 28, 2024
Author