How can I get WER (train/valid) for the audio_finetuning task with CTC? #5489

Addison-Weatherhead · 2024-04-22T16:13:10Z

I'm attempting to reproduce some results from the Data2Vec 2.0 paper, specifically the audio task results. I'm using the recommended commands from the Data2Vec 2.0 readme. Specifically, I've downloaded the data2vec Base model, with no fine tuning. I've downloaded the Libri-Light 10 hr data, and run libri-labels.py to obtain labels. The config I'm using for fine tuning is largely based off of the vox_10hr.yaml recommended in the readme, with a couple changes, see my full config below:

`# @Package group

common:
fp16: true
log_format: json
log_interval: 50
log_file: /h/myusername/fairseq/logs/log.json

checkpoint:
save_interval: 10
save_interval_updates: 10000
keep_interval_updates: 1
no_epoch_checkpoints: true
best_checkpoint_metric: wer

task:
_name: audio_finetuning
data: ???
normalize: true
labels: ltr

dataset:
num_workers: 2
max_tokens: 1280000
skip_invalid_size_inputs_valid_test: true
validate_after_updates: 0
validate_interval: 1
valid_subset: valid

distributed_training:
ddp_backend: legacy_ddp
distributed_world_size: 4

criterion:
_name: ctc
zero_infinity: true

optimization:
max_update: 20000
lr: [0.0001]
sentence_avg: true
update_freq: [5]

optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-08

lr_scheduler:
_name: tri_stage
phase_ratio: [0.1, 0.4, 0.5]
final_lr_scale: 0.05

model:
_name: wav2vec_ctc
w2v_path: ???
apply_mask: true
mask_prob: 0.75
mask_channel_prob: 0.25
mask_channel_length: 64
layerdrop: 0.1
activation_dropout: 0.1
feature_grad_mult: 0.0
freeze_finetune_updates: 10000
`

And for reference here is the command I run to fine tune:

python fairseq_cli/hydra_train.py -m \ --config-dir examples/wav2vec/config/finetuning \ --config-name vox_10h_noisyD2Vaudio \ +trainer.tensorboard_logdir=/h/myusername/fairseq/logs/tb/ \ task.data=/h/addisonw/fairseq/manifests/finetuning_data10h \ model.w2v_path=/h/myusername/fairseq/pretrained_models/base_libri.pt \ # this is the pre trained base model I downloaded common.user_dir=examples/data2vec

When running this, it is able to fine tune, and I see train loss metrics and various other things logged. My primary question is around getting WER metrics. When looking in to audio_finetuning.py, and the AudioFinetuningConfig, I see that eval_wer is only for Seq2Seq models, and I believe CTC with data2Vec would not qualify as this. How did the authors obtain WER values for their audio experiments?

EDIT: I decided to just try adding eval_wer and it actually works. Now I'm getting 100 for Validation WER constantly, meaning there's likely a mismatch between labels and predictions, particularly they mean something different. Can @alexeib or another contributor to Data2Vec 2.0 confirm if the numbers provided in the paper were from finetuning with CTC to predict phones or characters?

The text was updated successfully, but these errors were encountered:

Addison-Weatherhead added needs triage question labels Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I get WER (train/valid) for the audio_finetuning task with CTC? #5489

How can I get WER (train/valid) for the audio_finetuning task with CTC? #5489

Addison-Weatherhead commented Apr 22, 2024 •

edited

How can I get WER (train/valid) for the audio_finetuning task with CTC? #5489

How can I get WER (train/valid) for the audio_finetuning task with CTC? #5489

Comments

Addison-Weatherhead commented Apr 22, 2024 • edited

Addison-Weatherhead commented Apr 22, 2024 •

edited