Vision Transformer (ViT) based Food Classification

After the amazing performance of transformer architecture in natural language processing tasks Attention Is All You Need, it is used for image classification tasks recently. A team of Google consists of Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby complete the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale in 2020. Just like a sequence of words, a number of image patches are fed through the encoder block to obtain the feature. And finally Multi-layer perceptron (MLP) Head classifies the image. In this work, I used the Food-101 dataset which contains 101 classes, and implemented the Vision Transformer (ViT) model for classification. By using the pretrained model provided by Google and after proper training, the model obtained standard classification accuracy.

Project Overview

Food classification plays a vital role in various domains, such as nutrition analysis, dietary monitoring, and meal planning. Traditional approaches to food classification rely on handcrafted features or convolutional neural networks (CNNs). However, recent advancements in computer vision have introduced a new paradigm called Vision Transformers (ViTs), which have shown remarkable performance in image recognition tasks. This project aims to leverage the power of Vision Transformers for food classification tasks.
For a sequence of words, the transformer has an encoder-decoder structure but for an image, we need only encode the block to extract the feature. And just like a sequence of words image can be divided into a number of patches and fed through the encoder. And the encoder will find self-attention among the patches. Finally, MLP-head will classify the image.
Vision Transformer uses a standard Transformer encoder and fixed-size patches to achieve the State-of-the-Art (SOTA) task of image recognition. The author employs the conventional strategy of including an additional learnable "classification token" in the sequence in order to perform classification. In this project I follow several steps to complete the project and obtain better classification result.

Data Collection and Observation
Data Preprocessing
Create the Model and Modify
Observe the Test result

Data Collection and Observation

The dataset I used in this project is Food-101. It contains 101 classes with a total of 101'000 images. Some of the images are;

Data Preprocessing

First task is to process and split the dataset for the experiments. The dataset format is:

     food-101
     .  images
     .      apple_pie
     .          all_images
     .      baby_back_ribs
     .          all_images
     .      .
     .      .
     .      waffles
     .          all_images

Run the python file to rearrange the dataset as follows: python rearrange.py

     food-101
     .  train
     .      apple_pie
     .          all_images
     .      baby_back_ribs
     .          all_images
     .      .
     .      .
     .      waffles
     .          all_images
     .  test
     .      apple_pie
     .          all_images
     .      baby_back_ribs
     .          all_images
     .      .
     .      .
     .      waffles
     .          all_images

Create the Model and Modify

Vision Transformer follows the four equations to extract the image feature.
The initial image shape is (H×W×C) and after converting the image into patches it will be N×(PxP·C) where (P, P) is the resolution of each image patch and N is the number of patches.

Before

After converting into patches

Configs

According to the original paper of Vision Transformer:

image_size: 224.
Image size. If you have rectangular images, make sure your image size is the maximum of the width and height
patch_size: 16
Number of patches. image_size must be divisible by patch_size.
dim: 768.
Last dimension of output tensor after linear transformation nn.Linear(..., dim).
depth: 12.
Number of Transformer blocks.
heads: 12.
Number of heads in Multi-head Attention layer.
mlp_dim: 3072.
Dimension of the MLP (FeedForward) layer.
channels: int, default 3.
Number of image's channels.
dropout: float between [0, 1], default 0.
Dropout rate.
emb_dropout: float between [0, 1], default 0.
Embedding dropout rate.
Learning Rate: 3e-2
Learning Rate to update the parameters

Pre-trained model download (Google's Official Checkpoint)

There are many models available and in this work I use ViT-B_16(85.8M). Available models. The path of pre-trained checkpoint must be added in the python train.py file

Run the python file to train the the model : python train.py

Observe the Test result

The test accuracy is = 90.07%

Reference

Citations

@article{vaswani2017attention,
  title={Attention is all you need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},
  journal={Advances in neural information processing systems},
  volume={30},
  year={2017}
}

@article{dosovitskiy2020,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={arXiv preprint arXiv:2010.11929},
  year={2020}
}

@inproceedings{bossard14,
  title = {Food-101 -- Mining Discriminative Components with Random Forests},
  author = {Bossard, Lukas and Guillaumin, Matthieu and Van Gool, Luc},
  booktitle = {European Conference on Computer Vision},
  year = {2014}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
images		images
logs		logs
models		models
utils		utils
README.md		README.md
rearrange.py		rearrange.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

images

images

logs

logs

models

models

utils

utils

README.md

README.md

rearrange.py

rearrange.py

requirements.txt

requirements.txt

train.py

train.py

Repository files navigation

Vision Transformer (ViT) based Food Classification

Project Overview

Data Collection and Observation

Data Preprocessing

Create the Model and Modify

Before

After converting into patches

Configs

Pre-trained model download (Google's Official Checkpoint)

Observe the Test result

Reference

Citations

About

Releases

Packages

Languages

kamrul-brur/Vision-Transformer-based-Food-Classification

Folders and files

Latest commit

History

Repository files navigation

Vision Transformer (ViT) based Food Classification

Project Overview

Data Collection and Observation

Data Preprocessing

Create the Model and Modify

Before

After converting into patches

Configs

Pre-trained model download (Google's Official Checkpoint)

Observe the Test result

Reference

Citations

About

Topics

Resources

Stars

Watchers

Forks

Languages