Skip to content

Build and fine-tune your Image Classifier using a Vision Transformer Model from TensorFlow Hub

License

Notifications You must be signed in to change notification settings

sayannath/ViT-TF-Hub-Application

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub forks GitHub Repo stars GitHub last commit Twitter Follow Ask Me Anything !

Vision Transformer TF-Hub Application

PngItem_3011351 (1)

Description

This repositories show how to fine-tune a Vision Transformer model from TensorFlow Hub on the Image Scene Detection dataset.

Dataset Used

A newly collected Camera Scene Classification dataset consisting of images belonging to 30 different classes. This dataset is the part of the competition which is Mobile AI Workshop @ CVPR 2021. You can find the dataset details here.

Models

These models are available on TensorFlow Hub for Vision Transformer.

Image Classifiers

Feature Extractors

Note: As we want to fine-tune our model so we used the feature-extractor model and build the image classifier.

Benchmark Results

Sl No Models No of Parameters Accuracy Validation Accuracy
1 ViT-S/16 21,677,214 99.73% 96.87%
2 ViT R26-S/32(light aug) 36,058,462 99.70% 96.67%
3 ViT R26-S/32(medium aug) 36,058,462 99.80% 97.17%
4 ViT B/32 87,478,302 99.43% 96.87%
5 MobileNetV3Small 2,070,158 95.20% 92.73%
6 MobileNetV2 2,929,246 95.06% 88.89%
7 BigTransfer (BiT) 99.53% 96.97%

Note: Last three results are benchmarked during thr CVPR Competition. You can find the repository here.

Notebooks

ViT S/16
ViT R26-S/32 (Light Augmentation)
ViT R26-S/32 (Medium Augmentation)
ViT B/32
ViT R50-L/32
ViT B/16
ViT L/16
ViT B/8

Links

Sl No Models Colab Notebook TensorBoard
1 ViT-S/16 Link Link
2 ViT R26-S/32(light aug) Link Link
3 ViT R26-S/32(medium aug) Link Link
4 ViT B/32 Link Link

Each directory of model contains the particular notebook, python script, metric graph, train-logs(in .csv) and TensorBoard callbacks.

References

[1] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al.

[2] How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers by Steiner et al.

[3] Vision Transformer GitHub

[4] jax2tf tool

[5] Image Classification with Vision Transformer in Keras

[6] ViT-jax2tf

[7] Vision Transformers are Robust Learners, Repository

[8] Vision Transformer TF-Hub Model Collection

Acknowledgements

  • Thanks to Sayak Paul for building the models of ViT so that we can use Vision Transformer in a straight way.
  • Thanks to the authors of Vision Transformers for their efforts put into open-sourcing the models.

Contributors