ViT - An Image is worth 16x16 words Part 1: (Click here) Implemented this blog and trained the ViT model for Cats vs Dogs. Transfer learning can be used if needed. Part 2 (Click here) Explanation for all of the Vision Transformer classes.