Skip to content

Latest commit

 

History

History
52 lines (45 loc) · 2.6 KB

Transfer_learning.md

File metadata and controls

52 lines (45 loc) · 2.6 KB

Transfer Learning Code

Using the excellent timm package, the article result can be reproduced almost completely. Specifically, timm package enables to compare official pretraining and miil pretraining of ViT and Mixer model, and validate the improvement in transfer learning results. This comparison also enables to show how miil pretraining stabilizes transfer learning results, and make them far less susceptible to hyper-parameter selection.

An example training code on cifar100:

python train.py \
/Cifar100Folder/ \
-b=128 \
--img-size=224 \
--epochs=50 \
--color-jitter=0 \
--squish \
--amp \
--sched='cosine' \
--model-ema --model-ema-decay=0.995 --squish --reprob=0.5 --smoothing=0.1 \
--nonstrict_checkpoint --min-lr=1e-8 --warmup-epochs=3 --train-interpolation=bilinear --aa=v0 \
--pretrained \
--lr=2e-4 \
--model=mixer_b16_224_in21k \
--opt=adam --weight-decay=1e-4 \

These are the result we got for the official 21k pretrain (--model=mixer_b16_224_in21k) and miil 21k pretrain (--model=mixer_b16_224_miil_in21k), for different hyper-parameter selection:

Optimizer Weight decay Learning rate Official pretrain Mixer-B-16 score Miil pretrain Mixer-B-16 score
adam 1e-4 4e-4 82.6 90.5 (+7.9)
adam 1e-4 2e-4 84.0 91.1 (+7.1)
adamw 1e-4 2e-4 84.4 90.9 (+6.5)
adamw 1e-2 2e-4 84.7 90.9 (+6.2)
sgd 1e-4 2e-4 91.7 92.4 (+0.7)

We can see that miil pretraining reaches almost the same accuracy for all hyper-parameters, while the official pretraining suffers from a major drop in accuracy for adam and adamw. For every hyper-parameter tested, miil pretraining achieves better accuracy.