Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can i get the pretrained_teacher model by myself? #2

Open
SteveTanggithub opened this issue Jul 19, 2023 · 1 comment
Open

How can i get the pretrained_teacher model by myself? #2

SteveTanggithub opened this issue Jul 19, 2023 · 1 comment

Comments

@SteveTanggithub
Copy link

How can i get the pretrained_teacher model by myself instead of using the ones u provided?

@RicherMans
Copy link
Owner

Hey there,
just train a classifier on the standard 10s scale on Audioset.
You might notice, but I generally avoid publishing that code that uses 10s training on Audioset, mainly because of the issue when downloading the dataset, which would lead to a lot of questions.

There are just some caveats that are not really discussed in the paper, but are vital to the success of a good teacher:

  1. For CNNs/CRNNs, do not use global average (or any global -) pooling methods. These generally do not generalize to smaller-sized inputs that are necessary for PSL. Instead, always use decision-mean or -max pooling that can accurately predict single time-level audio tags. Here is an issue where we discussed the downfalls of global pooling methods.
  2. For Transformers, the positional embedding is crucial. Many implementations that I have seen use a wrong positional embedding. For example AST uses a time-frequency positional embedding, which is not variable in its size for cases where the input size changes. Thus that repo uses the extreme solution of always padding to 10s, which is completely unbearable in my opinion. The easiest way to avoid any problems is to have time/frequency independent positional embeddings, as proposed in passt, which also scale to smaller input lengths. These embeddings make actually more sense than any traditional positional embedding of a transformer: In a spectrogram, the position of your frequency is always fixed, you only need to encode that position a single time across all time-frames and not give each position a unique representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants