Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model fails to converge on transfer to audio backtesting problem #19

Open
yangma12 opened this issue May 26, 2023 · 5 comments
Open

Model fails to converge on transfer to audio backtesting problem #19

yangma12 opened this issue May 26, 2023 · 5 comments
Labels
question Further information is requested

Comments

@yangma12
Copy link

Dear Yuan and authors,
First of all, thank you for your paper. Recently, I migrated your pre-trained model to the regression prediction task of personality computing. After splicing several fully connected layers after your original model, the result is that the predicted value will only be maintained at a very low level during training. In a small interval, there will be no effective changes. Have you done relevant regression experiments? What are the possible reasons for this problem?
Sorry to bother you with my question and thank you very much for reading my question

yang

@YuanGongND
Copy link
Owner

hi there,

Do you mean you finetune our pretrained model for a regression task?

What do you by this?

After splicing several fully connected layers after your original model

-Yuan

@YuanGongND YuanGongND added the question Further information is requested label May 26, 2023
@yangma12
Copy link
Author

thank you for your reply!I mainly use this data set for fine-tuning, and separate the audio of this data set(https://chalearnlap.cvc.uab.cat/dataset/24/description/). Each audio is a 15-second speech audio, and the MLP is stitched after the model to adjust the dimension of the audio data output by the final model to ( batchsize,5), 5 corresponds to the regression value of five personality traits corresponding to an audio.

@yangma12
Copy link
Author

In the experiment, I tried to adjust the learning rate and other parameters, tried to remove the mask and mixing in the data preprocessing, set the input_tdim to 1530 to suit my audio length, label_dim to 512, and finally performed regression prediction through the following code : nn.Sequential(
nn.Linear(in_features=512, out_features=256),
nn.ReLU(inplace=True),
nn.Linear(in_features=256, out_features=128),
nn.ReLU(inplace=True),
nn.Linear(in_features=128, out_features=6),
nn. Sigmoid()
),Forgive me for not being deep enough in deep learning at the moment, I'm not sure where the problem might be.

@YuanGongND
Copy link
Owner

There are a few things:

  1. First, it seems a multi-modal, speech-dominated dataset. So you might want to try an audio-visual model or speech-based model (e.g., Hubert), according to my experience, for pure speech task, pure speech models are better, can you see the Table 5 of SSAST Paper? For audio-visual models, we have CAV-MAE for general audio-visual model, but again, you might need a model focusing on face.

  2. For this

nn.Sequential(
nn.Linear(in_features=512, out_features=256),
nn.ReLU(inplace=True),
nn.Linear(in_features=256, out_features=128),
nn.ReLU(inplace=True),
nn.Linear(in_features=128, out_features=6),
nn. Sigmoid()
)

Is Sigmoid common for regression? Setting "label_dim to 512" (for classification) and then a few dense layers seems to be redundent. You can just change the last MLP layer to a regression head.

self.mlp_head = nn.Sequential(nn.LayerNorm(self.original_embedding_dim),
nn.Linear(self.original_embedding_dim, label_dim))

But I know very little about your task. You need to tune the params by yourself. For some networks, we use a larger learning rate for the mlp layer because it is random initialized while other parameters are pretrained.

I mainly answer questions that are related to what we presented in the paper, and it is hard for me to answer questions regarding new task / usage of the model.

-Yuan

@YuanGongND
Copy link
Owner

Another minor point is that you said there are 5 regression values, but nn.Linear(in_features=128, out_features=6) shows 6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants