Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I get WER (train/valid) for the audio_finetuning task with CTC? #5489

Open
Addison-Weatherhead opened this issue Apr 22, 2024 · 0 comments

Comments

@Addison-Weatherhead
Copy link

Addison-Weatherhead commented Apr 22, 2024

I'm attempting to reproduce some results from the Data2Vec 2.0 paper, specifically the audio task results. I'm using the recommended commands from the Data2Vec 2.0 readme. Specifically, I've downloaded the data2vec Base model, with no fine tuning. I've downloaded the Libri-Light 10 hr data, and run libri-labels.py to obtain labels. The config I'm using for fine tuning is largely based off of the vox_10hr.yaml recommended in the readme, with a couple changes, see my full config below:

`# @Package group

common:
fp16: true
log_format: json
log_interval: 50
log_file: /h/myusername/fairseq/logs/log.json

checkpoint:
save_interval: 10
save_interval_updates: 10000
keep_interval_updates: 1
no_epoch_checkpoints: true
best_checkpoint_metric: wer

task:
_name: audio_finetuning
data: ???
normalize: true
labels: ltr

dataset:
num_workers: 2
max_tokens: 1280000
skip_invalid_size_inputs_valid_test: true
validate_after_updates: 0
validate_interval: 1
valid_subset: valid

distributed_training:
ddp_backend: legacy_ddp
distributed_world_size: 4

criterion:
_name: ctc
zero_infinity: true

optimization:
max_update: 20000
lr: [0.0001]
sentence_avg: true
update_freq: [5]

optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-08

lr_scheduler:
_name: tri_stage
phase_ratio: [0.1, 0.4, 0.5]
final_lr_scale: 0.05

model:
_name: wav2vec_ctc
w2v_path: ???
apply_mask: true
mask_prob: 0.75
mask_channel_prob: 0.25
mask_channel_length: 64
layerdrop: 0.1
activation_dropout: 0.1
feature_grad_mult: 0.0
freeze_finetune_updates: 10000
`

And for reference here is the command I run to fine tune:

python fairseq_cli/hydra_train.py -m \ --config-dir examples/wav2vec/config/finetuning \ --config-name vox_10h_noisyD2Vaudio \ +trainer.tensorboard_logdir=/h/myusername/fairseq/logs/tb/ \ task.data=/h/addisonw/fairseq/manifests/finetuning_data10h \ model.w2v_path=/h/myusername/fairseq/pretrained_models/base_libri.pt \ # this is the pre trained base model I downloaded common.user_dir=examples/data2vec

When running this, it is able to fine tune, and I see train loss metrics and various other things logged. My primary question is around getting WER metrics. When looking in to audio_finetuning.py, and the AudioFinetuningConfig, I see that eval_wer is only for Seq2Seq models, and I believe CTC with data2Vec would not qualify as this. How did the authors obtain WER values for their audio experiments?

EDIT: I decided to just try adding eval_wer and it actually works. Now I'm getting 100 for Validation WER constantly, meaning there's likely a mismatch between labels and predictions, particularly they mean something different. Can @alexeib or another contributor to Data2Vec 2.0 confirm if the numbers provided in the paper were from finetuning with CTC to predict phones or characters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant