Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve alignment accuracy by normalizing audio features #625

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

IbrahimAmin1
Copy link

Audio data should be pre-processed using the Wav2Vec2Processor (Wav2Vec2FeatureExtractor), I have noticed considerable alignment improvement (Mean absolute error) when audio is normalized (zero mean and unit variance) using the processor before the forward pass.

Other than that, Each Hugging face Wav2Vec2 Feature Extractor configuration should contain the same config used during fine-tuning these models (e.g. normalization, attention_mask usage, etc..)

A typical hugging face Wav2Vec2 Feature Extractor config file is as follows:

{
  "do_normalize": true,
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": true,
  "sampling_rate": 16000
}

To maintain backwards compatibility, I have opted to let the user determine if Pre-processing should be applied or not, but chose to set Pre-processing as the default option as it improves alignment considerably.

@IbrahimAmin1 IbrahimAmin1 changed the title Improve alignment accuracy by normalizing audio features using Wav2Ve… Improve alignment accuracy by normalizing audio features Dec 13, 2023
Fix a typo in the preprocess argument
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant