Skip to content
This repository has been archived by the owner on Mar 15, 2024. It is now read-only.

The ablation experiment of DeiT #215

Open
Berry-Wu opened this issue Apr 21, 2023 · 2 comments
Open

The ablation experiment of DeiT #215

Berry-Wu opened this issue Apr 21, 2023 · 2 comments

Comments

@Berry-Wu
Copy link

Berry-Wu commented Apr 21, 2023

Hi, thanks for your great work!
I'm confused about the setting of ablation experiment of DeiT below:
image
As you can see, the DeiT– usual distillation and DeiT– hard distillation don't use the GT for training?
But in the early version of the paper, the setting is contrary, which indicades the GT label is used for training. Like this:
image
In this experiment, the result indicades that the model supervised by teacher's output is better than the GT. That's all right?
Can you explain the reason of the phenomenon for me? Looking forward your reply! :)

@TouvronHugo
Copy link
Contributor

Hi @Berry-Wu ,
Thanks for your message.
Sorry, the table is maybe not very clear we use the GT labels with the different distillation approach.
The advantage of distillation is that it can adapt to data-augmentation which can make the label noisy (see the example below).
Screenshot 2023-05-22 at 11 39 41
Best,
Hugo

@Berry-Wu
Copy link
Author

@TouvronHugo
Thanks for your reply!:)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants