Skip to content

jhgan00/image-retrieval-transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training Vision Transformers for Image Retrieval

  • (Unofficial) PyTorch implementation of Training Vision Transformers for Image Retrieval(El-Nouby, Alaaeldin, et al. 2021).
  • I have not yet achieved exactly the same results as reported in the paper(Differential entropy regularization does not have much effect on In-shop and SOP datasets).

Requirements

# Python 3.7
pip install -r requirements.txt

Training

  • See scripts/train.*.sh
# CUB-200-2011
python main.py \
  --model deit_small_distilled_patch16_224 \
  --max-iter 2000 \
  --dataset cub200 \
  --data-path /data/CUB_200_2011 \
  --rank 1 2 4 8 \
  --lambda-reg 0.7
# Stanford Online Products
python main.py \
  --model deit_small_distilled_patch16_224 \
  --max-iter 35000 \
  --dataset sop \
  --m 2 \
  --data-path /data/Stanford_Online_Products \
  --rank 1 10 100 1000 \
  --lambda-reg 0.7
# In-shop
python main.py \
  --model deit_small_distilled_patch16_224 \
  --max-iter 35000 \
  --dataset inshop \
  --data-path /data/In-shop \
  --m 2 \
  --rank 1 10 20 30 \
  --memory-ratio 0.2 \
  --device cuda:2 \
  --encoder-momentum 0.999 \
  --lambda-reg 0.7

Experiments

  • IRTO – off-the-shelf extraction of features from a ViT backbone, pre-trained on ImageNet;
  • IRTL – fine-tuning a transformer with metric learning, in particular with a contrastive loss;
  • IRTR – additionally regularizing the output feature space to encourage uniformity.
  • †: Models pre-trained with distillation with a convnet trained on ImageNet1k
Method Backbone SOP CUB-200 In-Shop
1 10 100 1000 1 2 4 8 1 10 20 30
IRTO DeiT-S 53.12 68.96 81.60 94.09 58.68 71.30 80.96 88.18 31.28 57.03 64.20 68.28
IRTL DeiT-S 83.56 93.29 97.23 99.03 73.68 82.58 88.77 92.71 93.09 98.28 98.74 99.02
IRTR DeiT-S 82.67 92.73 96.69 98.80 73.73 82.91 89.30 93.35 90.47 97.97 98.61 98.92
IRTR DeiT-S† 82.70 92.85 96.92 98.86 76.55 85.26 90.92 94.65 90.66 98.16 98.68 98.99

References

  • El-Nouby, Alaaeldin, et al. "Training vision transformers for image retrieval." arXiv preprint arXiv:2102.05644 (2021).

About

(Unofficial) PyTorch implementation of Training Vision Transformers for Image Retrieval(El-Nouby, Alaaeldin, et al. 2021).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published