I want to use frame-level ssast just for frame-level audio token extraction #16

9B8DY6 · 2022-12-20T12:08:18Z

in your ast_models.py, you put cluster True as Default

But if to use frame-level ssast, cluster should be False. Do I have to turn it off?

If I want to use your pretrained frame level ssast for audio token extraction, is the output of self.v.norm(x) except the first one what I have to use in finetuningcls function? because the first one is cls token....^^

One more thing I wonder....Could I get some part of fbank that is corresponding to the video frames? melspectrogram does but I don't know fbank could be....

YuanGongND · 2022-12-21T04:38:54Z

Hi there,

Thanks for reaching out.

1/

In your ast_models.py, you put cluster True as Default ... But if to use frame-level ssast, cluster should be False. Do I have to turn it off?

You are correct that cluster=True is default in the model script, but we do pass cluster=False for frame-level when we instantiate the model, please see here:

ssast/src/run.py

Lines 125 to 130 in a1a3eec

    
           if 'pretrain' in args.task: 
        
               cluster = (args.num_mel_bins != args.fshape) 
        
               if cluster == True: 
        
                   print('The num_mel_bins {:d} and fshape {:d} are different, not masking a typical time frame, using cluster masking.'.format(args.num_mel_bins, args.fshape)) 
        
               else: 
        
                   print('The num_mel_bins {:d} and fshape {:d} are same, masking a typical time frame, not using cluster masking.'.format(args.num_mel_bins, args.fshape))

FYI, you can use cluster=True for frame-level AST, but from my experience, it will lead to a performance drop.

2/

If I want to use your pretrained frame level ssast for audio token extraction, is the output of self.v.norm(x) except the first one what I have to use in finetuningcls function? because the first one is cls token

You are correct, but please be cautious on that cls_token_num might not always be 1 for all models.

3/

One more thing I wonder....Could I get some part of fbank that is corresponding to the video frames? melspectrogram does but I don't know fbank could be....

This involves audio-visual learning while this paper is about pure audio research. But we do use fbank features as input in this paper, see:

ssast/src/dataloader.py

Lines 126 to 127 in a1a3eec

    
           fbank = torchaudio.compliance.kaldi.fbank(waveform, htk_compat=True, sample_frequency=sr, use_energy=False, 
        
                                                     window_type='hanning', num_mel_bins=self.melbins, dither=0.0, frame_shift=10)

Hope these help.

-Yuan

9B8DY6 · 2022-12-21T08:16:56Z

Have you tried other input types like melspectrograms or mfcc? @YuanGongND I am gonna try feeding melspectrograms to SSAST to extract audio feature....Is it okay?
Could I ask you why cls token number is not consistent as 1? It can be 2 because of dist_token? Then, what is dist_token?

YuanGongND · 2022-12-21T17:06:17Z

Have you tried other input types like melspectrograms or mfcc? @YuanGongND I am gonna try feeding melspectrograms to SSAST to extract audio feature....Is it okay?

I have never tried other input features. You can pretrain your own model with other input feature, but if you plan to use our pretrained model to extract feature/embedding/token, then you have to use the same dataloader (which is fully released in this repo) with us, any input distribution shift could cause a dramatic performance difference.

Could I ask you why cls token number is not consistent as 1? It can be 2 because of dist_token? Then, what is dist_token?

dist_token stands for distillation token, please read our AST paper for details. SSAST does not need this token, but our code is compatible with old AST models.

-Yuan

mwaseemrandhawa · 2024-05-27T13:00:46Z

Have you tried other input types like melspectrograms or mfcc? @YuanGongND I am gonna try feeding melspectrograms to SSAST to extract audio feature....Is it okay? Could I ask you why cls token number is not consistent as 1? It can be 2 because of dist_token? Then, what is dist_token?

Have you trained the model with melspectrograms and what are the results? I'm also wanted to train the model with cqt.

9B8DY6 changed the title ~~I want to use frame-level ssast~~ I want to use frame-level ssast just for frame-level audio token extraction Dec 20, 2022

YuanGongND added the question Further information is requested label Dec 21, 2022

YuanGongND mentioned this issue Dec 28, 2022

Can AST be used for audio representation towards solving the frame-level classification tasks? YuanGongND/ast#90

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I want to use frame-level ssast just for frame-level audio token extraction #16

I want to use frame-level ssast just for frame-level audio token extraction #16

9B8DY6 commented Dec 20, 2022 •

edited

YuanGongND commented Dec 21, 2022 •

edited

9B8DY6 commented Dec 21, 2022 •

edited

YuanGongND commented Dec 21, 2022 •

edited

mwaseemrandhawa commented May 27, 2024

I want to use frame-level ssast just for frame-level audio token extraction #16

I want to use frame-level ssast just for frame-level audio token extraction #16

Comments

9B8DY6 commented Dec 20, 2022 • edited

YuanGongND commented Dec 21, 2022 • edited

9B8DY6 commented Dec 21, 2022 • edited

YuanGongND commented Dec 21, 2022 • edited

mwaseemrandhawa commented May 27, 2024

9B8DY6 commented Dec 20, 2022 •

edited

YuanGongND commented Dec 21, 2022 •

edited

9B8DY6 commented Dec 21, 2022 •

edited

YuanGongND commented Dec 21, 2022 •

edited