Librispeech960 Pretrained Model #27

JDRanpariya · 2023-12-15T11:24:55Z

Hey I'm curious on why you don't have Librispeech960 Pre-trained on Frame base. I saw you were recommending Frame based models. Do you have Pre-trained Librispeech on Frame?

JDRanpariya · 2023-12-16T05:19:46Z

How much data shall I use for fine tuning to get decent results to avoid initial over fitting? Does 50 files of 5 sec work in training? what's the general rule when dealing with over fitting in case of transformers? Do we really need more data to fine tune with or is it hyper parameters?

YuanGongND · 2023-12-16T21:33:43Z

hi there,

Hey I'm curious on why you don't have Librispeech960 Pre-trained on Frame base. I saw you were recommending Frame based models. Do you have Pre-trained Librispeech on Frame?

We do have AudioSet+Librispeech pretrained checkpoint for frame based AST, see https://github.com/YuanGongND/ssast#pretrained-models. One conclusion in our ablation study is that this checkpoint would be better than the model trained solely on Librispeech, even on speech tasks.

Note that for speech tasks, we do not mean ASR, but speech classification, e.g., command recognition, emotion recognition, etc.

How much data shall I use for fine tuning to get decent results to avoid initial over fitting? Does 50 files of 5 sec work in training? what's the general rule when dealing with over fitting in case of transformers? Do we really need more data to fine tune with or is it hyper parameters?

It is hard to estimate as there are many factors (e.g., how many classes, how easy it is to sepearate sounds). You would need to try, but 50 files is a very small number. The smallest dataset we tested is ESC-50 (50 classes, each 40 samples, total 2000 samples).

-Yuan

JDRanpariya · 2023-12-19T03:27:38Z

Hey Thanks Yuan,

Nice answer, I guess I got the idea on what factors I should be looking for when deciding smallest dataset. kudos!

I appreciate your answer to the Librispeech model, I guess I should have framed the question a little different. Anyway, from what I understand Frame-400 trained on both Audioset and Librispeech should perform better than others for Speech classification.

Looking at ablation study in paper and table 2, I can't find whether Libripseech(only) has been trained with patch or frame . From table 5 I can see that Librispeech only has been paired with patch.

It would be nice to the benchmarks for Librispeech only with frames for speech tasks. It's just that I'm unable to find it either on paper or Github readme. Apologies for inconvenience.

Best Regards,
Jaydeep

YuanGongND added the question Further information is requested label Dec 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Librispeech960 Pretrained Model #27

Librispeech960 Pretrained Model #27

JDRanpariya commented Dec 15, 2023

JDRanpariya commented Dec 16, 2023

YuanGongND commented Dec 16, 2023

JDRanpariya commented Dec 19, 2023

Librispeech960 Pretrained Model #27

Librispeech960 Pretrained Model #27

Comments

JDRanpariya commented Dec 15, 2023

JDRanpariya commented Dec 16, 2023

YuanGongND commented Dec 16, 2023

JDRanpariya commented Dec 19, 2023