The ablation experiment of DeiT #215

Berry-Wu · 2023-04-21T07:06:43Z

Hi, thanks for your great work!
I'm confused about the setting of ablation experiment of DeiT below:

As you can see, the DeiT– usual distillation and DeiT– hard distillation don't use the GT for training?
But in the early version of the paper, the setting is contrary, which indicades the GT label is used for training. Like this:

In this experiment, the result indicades that the model supervised by teacher's output is better than the GT. That's all right?
Can you explain the reason of the phenomenon for me? Looking forward your reply! :)

The text was updated successfully, but these errors were encountered:

TouvronHugo · 2023-05-22T09:40:25Z

Hi @Berry-Wu ,
Thanks for your message.
Sorry, the table is maybe not very clear we use the GT labels with the different distillation approach.
The advantage of distillation is that it can adapt to data-augmentation which can make the label noisy (see the example below).

Best,
Hugo

Berry-Wu · 2023-05-26T14:08:25Z

@TouvronHugo
Thanks for your reply！:)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The ablation experiment of DeiT #215

The ablation experiment of DeiT #215

Berry-Wu commented Apr 21, 2023 •

edited

TouvronHugo commented May 22, 2023

Berry-Wu commented May 26, 2023

The ablation experiment of DeiT #215

The ablation experiment of DeiT #215

Comments

Berry-Wu commented Apr 21, 2023 • edited

TouvronHugo commented May 22, 2023

Berry-Wu commented May 26, 2023

Berry-Wu commented Apr 21, 2023 •

edited