Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target length #11

Open
kremHabashy opened this issue Jul 15, 2022 · 2 comments
Open

Target length #11

kremHabashy opened this issue Jul 15, 2022 · 2 comments
Labels
question Further information is requested

Comments

@kremHabashy
Copy link

Hi Yuan,

Thanks again for this great work, I have been using both this and the original AST model for some downstream tasks. I am currently looking into some other time series data, and was wondering if there was a particular reason you chose 10 seconds for the audio length during audioset pretraining. Why not 5 seconds, 15? Did you consult any specific resources to conclude this or is it more arbitrary?

Thanks,
Karim

@YuanGongND YuanGongND added the question Further information is requested label Jul 15, 2022
@YuanGongND
Copy link
Owner

YuanGongND commented Jul 15, 2022

Hi Karim,

The main reason is AudioSet, the primary dataset we used to pretrain the SSAST model, mostly consists of 10s audios. Using longer or shorter audio lengths is perfectly fine. In my opinion, when the downstream task is unknown, the pretraining audio length should be the longer the better because we use cut/interpolate to adjust the audio length (positional embedding) between the pretraining and fine-tuning stage. Cut should be better than interpolation. However, Transformer is O(n^2) so longer input will be more computationally expensive.

This is the code for positional embedding for different input length:

new_pos_embed = self.v.pos_embed[:, self.cls_token_num:, :].detach().reshape(1, p_num_patches, self.original_embedding_dim).transpose(1, 2).reshape(1, self.original_embedding_dim, p_f_dim, p_t_dim)
# cut or interpolate the positional embedding
if t_dim < p_t_dim:
new_pos_embed = new_pos_embed[:, :, :, int(p_t_dim/2) - int(t_dim / 2): int(p_t_dim/2) - int(t_dim / 2) + t_dim]
else:
new_pos_embed = torch.nn.functional.interpolate(new_pos_embed, size=(8, t_dim), mode='bilinear')
if f_dim < p_f_dim:
new_pos_embed = new_pos_embed[:, :, int(p_f_dim/2) - int(f_dim / 2): int(p_f_dim/2) - int(f_dim / 2) + t_dim, :]
else:
new_pos_embed = torch.nn.functional.interpolate(new_pos_embed, size=(f_dim, t_dim), mode='bilinear')

-Yuan

@kremHabashy
Copy link
Author

Thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants