Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want to use frame-level ssast just for frame-level audio token extraction #16

Open
9B8DY6 opened this issue Dec 20, 2022 · 4 comments
Labels
question Further information is requested

Comments

@9B8DY6
Copy link

9B8DY6 commented Dec 20, 2022

image

in your ast_models.py, you put cluster True as Default

image
But if to use frame-level ssast, cluster should be False. Do I have to turn it off?

image
If I want to use your pretrained frame level ssast for audio token extraction, is the output of self.v.norm(x) except the first one what I have to use in finetuningcls function? because the first one is cls token....^^

One more thing I wonder....Could I get some part of fbank that is corresponding to the video frames? melspectrogram does but I don't know fbank could be....

@9B8DY6 9B8DY6 changed the title I want to use frame-level ssast I want to use frame-level ssast just for frame-level audio token extraction Dec 20, 2022
@YuanGongND
Copy link
Owner

YuanGongND commented Dec 21, 2022

Hi there,

Thanks for reaching out.

1/

In your ast_models.py, you put cluster True as Default ... But if to use frame-level ssast, cluster should be False. Do I have to turn it off?

You are correct that cluster=True is default in the model script, but we do pass cluster=False for frame-level when we instantiate the model, please see here:

ssast/src/run.py

Lines 125 to 130 in a1a3eec

if 'pretrain' in args.task:
cluster = (args.num_mel_bins != args.fshape)
if cluster == True:
print('The num_mel_bins {:d} and fshape {:d} are different, not masking a typical time frame, using cluster masking.'.format(args.num_mel_bins, args.fshape))
else:
print('The num_mel_bins {:d} and fshape {:d} are same, masking a typical time frame, not using cluster masking.'.format(args.num_mel_bins, args.fshape))

FYI, you can use cluster=True for frame-level AST, but from my experience, it will lead to a performance drop.

2/

If I want to use your pretrained frame level ssast for audio token extraction, is the output of self.v.norm(x) except the first one what I have to use in finetuningcls function? because the first one is cls token

You are correct, but please be cautious on that cls_token_num might not always be 1 for all models.

3/

One more thing I wonder....Could I get some part of fbank that is corresponding to the video frames? melspectrogram does but I don't know fbank could be....

This involves audio-visual learning while this paper is about pure audio research. But we do use fbank features as input in this paper, see:

ssast/src/dataloader.py

Lines 126 to 127 in a1a3eec

fbank = torchaudio.compliance.kaldi.fbank(waveform, htk_compat=True, sample_frequency=sr, use_energy=False,
window_type='hanning', num_mel_bins=self.melbins, dither=0.0, frame_shift=10)

Hope these help.

-Yuan

@YuanGongND YuanGongND added the question Further information is requested label Dec 21, 2022
@9B8DY6
Copy link
Author

9B8DY6 commented Dec 21, 2022

Have you tried other input types like melspectrograms or mfcc? @YuanGongND I am gonna try feeding melspectrograms to SSAST to extract audio feature....Is it okay?
Could I ask you why cls token number is not consistent as 1? It can be 2 because of dist_token? Then, what is dist_token?

@YuanGongND
Copy link
Owner

YuanGongND commented Dec 21, 2022

Have you tried other input types like melspectrograms or mfcc? @YuanGongND I am gonna try feeding melspectrograms to SSAST to extract audio feature....Is it okay?

I have never tried other input features. You can pretrain your own model with other input feature, but if you plan to use our pretrained model to extract feature/embedding/token, then you have to use the same dataloader (which is fully released in this repo) with us, any input distribution shift could cause a dramatic performance difference.

Could I ask you why cls token number is not consistent as 1? It can be 2 because of dist_token? Then, what is dist_token?

dist_token stands for distillation token, please read our AST paper for details. SSAST does not need this token, but our code is compatible with old AST models.

-Yuan

@mwaseemrandhawa
Copy link

Have you tried other input types like melspectrograms or mfcc? @YuanGongND I am gonna try feeding melspectrograms to SSAST to extract audio feature....Is it okay? Could I ask you why cls token number is not consistent as 1? It can be 2 because of dist_token? Then, what is dist_token?

Have you trained the model with melspectrograms and what are the results? I'm also wanted to train the model with cqt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants