Train a custom CLIP with DeepSpeed CPU offload, 16 bit precision #388

afiaka87 · 2021-11-20T22:01:22Z

(disclaimer): this is code for training a custom CLIP from the repository here, not the one in the OpenAI repo. For something like that I recommend open_clip. There are valid concerns about the effectiveness of a CLIP trained with a low batch size as the retrieval task has far less context to work with. Food for thought.

There's plenty left to do to make this as robust as the other training scripts, but if you have deepspeed working, this should work now with far fewer caveats than DALL-e. I trained a small CLIP last night on COCO using 16 bit precision, deepspeed stage 3 and cpu offload for both params and the optimizer. I havent done many rigorous comparisons but I was able to actually use my computer while training with it due to cpu offload, which was refreshing.

weights and biases workspace:
https://wandb.ai/dalle-pytorch-replicate/dalle_train_clip_report

I'll be busy for the holidays most likely so won't have time to implement everything else, but it's mostly just copying from the work done in previous contributions in train_dalle.py/train_vae.py. I suspect @janEbert was responsible for ensuring external parameters were flagged for deepspeed in @lucidrains CLIP implementation?

There are likely to be errors as well and there's probably a few things missing from the CLIP paper. I think they clamped their logits to ln(2) or similar - not sure if we're doing that.

to run with deepspeed, bite the bullet and setup a docker container targeting pytorch=1.7.1, cuda=10.2. Conda works too - make sure you set your python=3.7 as there are issues with >3.7. There's no guarantee that fused operations will run on any particular GPU, even with a docker container, and indeed the only officially supported ones are the V100 and A100. If you see an error about failed JIT compilation - that may be the reason.

run_train_clip.sh

#!/bin/bash

deepspeed train_clip.py --dataset_dir=/mnt/evo_internal_1TB/DATASETS/COCO \
    --epochs=200 \
    --batch_size=128 \
    --learning_rate=0.004 \
    --clip_grad_norm=1.0 \
    --resize_ratio=0.8 \
    --truncate_captions=True \
    --save_every_n_steps=1000 \
    --log_frequency=10 \
    --clip_output_file_name=clip_latest.pt \
    --dim_text=128 \
    --dim_image=128 \
    --dim_latent=256 \
    --text_enc_depth=6 \
    --text_seq_len=128 \
    --text_heads=8 \
    --num_visual_tokens=256 \
    --visual_enc_depth=6 \
    --visual_heads=8 \
    --visual_image_size=128 \
    --visual_patch_size=16 \
    --channels=3 \
    --num_workers=24 \
    --fp16=True \
    --distributed_backend=deepspeed

After training has finished, you can create a 32-bit pytorch checkpoint by opening the checkpoint directory:

cd checkpoints 
cp globalstep_99999/convert_to_fp32.py . # desired step, usually the biggest number
python convert_to_fp32.py globalstep_99999 my_normal_pytorch_ckpt.bin

…eed stage 3 round robin/gradient accumulate/cpu offload, 16-bit precision, WarmupLRDecay init, wandb logging, argparsing

janEbert · 2021-11-21T12:54:08Z

Hi! About the external parameters, I looked through the model and don't think anything needs to be registered, so that should be all good. :)

(custom_clip) create train_clip.py - image text folder loader, deepsp…

f19a35d

…eed stage 3 round robin/gradient accumulate/cpu offload, 16-bit precision, WarmupLRDecay init, wandb logging, argparsing

afiaka87 changed the title ~~(custom_clip) create train_clip.py - image text folder loader, deepsp…~~ Train a DALLE-pytorch CLIP with CPU offload, 16 bit precision Nov 20, 2021

afiaka87 changed the title ~~Train a DALLE-pytorch CLIP with CPU offload, 16 bit precision~~ Train a custom CLIP with DeepSpeed CPU offload, 16 bit precision Nov 20, 2021

Shico69 approved these changes Apr 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train a custom CLIP with DeepSpeed CPU offload, 16 bit precision #388

Train a custom CLIP with DeepSpeed CPU offload, 16 bit precision #388

afiaka87 commented Nov 20, 2021 •

edited

janEbert commented Nov 21, 2021

Train a custom CLIP with DeepSpeed CPU offload, 16 bit precision #388

Are you sure you want to change the base?

Train a custom CLIP with DeepSpeed CPU offload, 16 bit precision #388

Conversation

afiaka87 commented Nov 20, 2021 • edited

janEbert commented Nov 21, 2021

afiaka87 commented Nov 20, 2021 •

edited