Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How was the ground truth in the article be set? How to get it? #16

Open
Ko-vey opened this issue May 17, 2022 · 4 comments
Open

How was the ground truth in the article be set? How to get it? #16

Ko-vey opened this issue May 17, 2022 · 4 comments

Comments

@Ko-vey
Copy link

Ko-vey commented May 17, 2022

How was the ground truth in the article be set? How to get it?

@RicherMans
Copy link
Owner

Sorry can you explain exactly what you refer to as the ground truth?

@RicherMans
Copy link
Owner

Its from the DCASE 2018 and 2019 datasets, they strongly labeled their evaluation datasets.

@Ko-vey
Copy link
Author

Ko-vey commented May 17, 2022

Sorry can you explain exactly what you refer to as the ground truth?

2022-05-17 205050
2022-05-17 205126

Thanks for your brilliant work and patience ! There are a few questiones I want to know:

  1. Like the pictures shown above, how the ground truth label of speech activation period was set for evaluation and comparison?
  2. How do we know the performance of a student model trained with the help of teacher model( t1 or t2 ) without exact frame-level label on a new dataset?
  3. I am recently working on birdcall activation detection task based on your model, but by replacing the speech label with birdcall label in teacher-student approach on Audioset balanced subset, the new student model seemed to learn nothing from t1 and performed poorly on bird audio file. Could you give some advice on how to train a proper model ?

@RicherMans
Copy link
Owner

Oh hey, yeah no problem with these questions:

Like the pictures shown above, how the ground truth label of speech activation period was set for evaluation and comparison?

Its manually labeled by the DCASE authors, nothing special here, its not predicted by any of my models. All these datasets are publicly available here.

How do we know the performance of a student model trained with the help of teacher model( t1 or t2 ) without exact frame-level label on a new dataset?

I mean you can use some external dataset for cross-validation during training. I forgot what I did use for validation or if I did any. Usually this approach should work.

I am recently working on birdcall activation detection task based on your model, but by replacing the speech label with birdcall label in teacher-student approach on Audioset balanced subset, the new student model seemed to learn nothing from t1 and performed poorly on bird audio file. Could you give some advice on how to train a proper model ?

Oh yeah, that's an interesting task! So my current teacher model is pretty bad in comparison to some other models on Audioset.
But you need to recall that like 40% of all labels in audioset are speech and also that the labeling procedure of this ``Speech'' label is rather precise ( because I mean its speech, nothing complicated ).
Just recall that my model has seen at least ~ 2 Million samples containing speech.
On the other hand, you task using birds is much more complicated. Further the labels in audioset might not be "optimal" to say the least, since many labels describe birds. Also "bird" related classes are pretty rare compared to "speech". Even though they might achieve a high mAP on the dataset, it does not mean that the model can effectively predict these classes.

I recommend you to actually fine-tune your model first on a bird-specific dataset, then re-estimate on the balanced dataset and then train a student. It might be worth a try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants