Target length #11

kremHabashy · 2022-07-15T15:51:37Z

Hi Yuan,

Thanks again for this great work, I have been using both this and the original AST model for some downstream tasks. I am currently looking into some other time series data, and was wondering if there was a particular reason you chose 10 seconds for the audio length during audioset pretraining. Why not 5 seconds, 15? Did you consult any specific resources to conclude this or is it more arbitrary?

Thanks,
Karim

YuanGongND · 2022-07-15T17:36:51Z

Hi Karim,

The main reason is AudioSet, the primary dataset we used to pretrain the SSAST model, mostly consists of 10s audios. Using longer or shorter audio lengths is perfectly fine. In my opinion, when the downstream task is unknown, the pretraining audio length should be the longer the better because we use cut/interpolate to adjust the audio length (positional embedding) between the pretraining and fine-tuning stage. Cut should be better than interpolation. However, Transformer is O(n^2) so longer input will be more computationally expensive.

This is the code for positional embedding for different input length:

ssast/src/models/ast_models.py

Lines 192 to 201 in bfc5c1a

    
           new_pos_embed = self.v.pos_embed[:, self.cls_token_num:, :].detach().reshape(1, p_num_patches, self.original_embedding_dim).transpose(1, 2).reshape(1, self.original_embedding_dim, p_f_dim, p_t_dim) 
        
           # cut or interpolate the positional embedding 
        
           if t_dim < p_t_dim: 
        
               new_pos_embed = new_pos_embed[:, :, :, int(p_t_dim/2) - int(t_dim / 2): int(p_t_dim/2) - int(t_dim / 2) + t_dim] 
        
           else: 
        
               new_pos_embed = torch.nn.functional.interpolate(new_pos_embed, size=(8, t_dim), mode='bilinear') 
        
           if f_dim < p_f_dim: 
        
               new_pos_embed = new_pos_embed[:, :, int(p_f_dim/2) - int(f_dim / 2): int(p_f_dim/2) - int(f_dim / 2) + t_dim, :] 
        
           else: 
        
               new_pos_embed = torch.nn.functional.interpolate(new_pos_embed, size=(f_dim, t_dim), mode='bilinear')

-Yuan

kremHabashy · 2022-07-19T14:21:18Z

Thank you!!

YuanGongND added the question Further information is requested label Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Target length #11

Target length #11

kremHabashy commented Jul 15, 2022

YuanGongND commented Jul 15, 2022 •

edited

kremHabashy commented Jul 19, 2022

Target length #11

Target length #11

Comments

kremHabashy commented Jul 15, 2022

YuanGongND commented Jul 15, 2022 • edited

kremHabashy commented Jul 19, 2022

YuanGongND commented Jul 15, 2022 •

edited