Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Librispeech960 Pretrained Model #27

Open
JDRanpariya opened this issue Dec 15, 2023 · 3 comments
Open

Librispeech960 Pretrained Model #27

JDRanpariya opened this issue Dec 15, 2023 · 3 comments
Labels
question Further information is requested

Comments

@JDRanpariya
Copy link

Hey I'm curious on why you don't have Librispeech960 Pre-trained on Frame base. I saw you were recommending Frame based models. Do you have Pre-trained Librispeech on Frame?

@JDRanpariya
Copy link
Author

How much data shall I use for fine tuning to get decent results to avoid initial over fitting? Does 50 files of 5 sec work in training? what's the general rule when dealing with over fitting in case of transformers? Do we really need more data to fine tune with or is it hyper parameters?

@YuanGongND
Copy link
Owner

hi there,

Hey I'm curious on why you don't have Librispeech960 Pre-trained on Frame base. I saw you were recommending Frame based models. Do you have Pre-trained Librispeech on Frame?

We do have AudioSet+Librispeech pretrained checkpoint for frame based AST, see https://github.com/YuanGongND/ssast#pretrained-models. One conclusion in our ablation study is that this checkpoint would be better than the model trained solely on Librispeech, even on speech tasks.

Note that for speech tasks, we do not mean ASR, but speech classification, e.g., command recognition, emotion recognition, etc.

How much data shall I use for fine tuning to get decent results to avoid initial over fitting? Does 50 files of 5 sec work in training? what's the general rule when dealing with over fitting in case of transformers? Do we really need more data to fine tune with or is it hyper parameters?

It is hard to estimate as there are many factors (e.g., how many classes, how easy it is to sepearate sounds). You would need to try, but 50 files is a very small number. The smallest dataset we tested is ESC-50 (50 classes, each 40 samples, total 2000 samples).

-Yuan

@YuanGongND YuanGongND added the question Further information is requested label Dec 16, 2023
@JDRanpariya
Copy link
Author

Hey Thanks Yuan,

Nice answer, I guess I got the idea on what factors I should be looking for when deciding smallest dataset. kudos!

I appreciate your answer to the Librispeech model, I guess I should have framed the question a little different. Anyway, from what I understand Frame-400 trained on both Audioset and Librispeech should perform better than others for Speech classification.

Looking at ablation study in paper and table 2, I can't find whether Libripseech(only) has been trained with patch or frame . From table 5 I can see that Librispeech only has been paired with patch.

It would be nice to the benchmarks for Librispeech only with frames for speech tasks. It's just that I'm unable to find it either on paper or Github readme. Apologies for inconvenience.

Best Regards,
Jaydeep

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants