Skip to content

Latest commit

 

History

History
97 lines (78 loc) · 7.31 KB

PITCH_MANIPULATION_ASR.md

File metadata and controls

97 lines (78 loc) · 7.31 KB

Pitch Manipulation to mitigate Gender Bias (ASRU 2023)

Code and models for the paper: "No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition Through Pitch Manipulation" accepted at ASRU 2023.

Models and Outputs

To ensure complete reproducibility, we release the ASR model checkpoints used in our experiments, together with the SentencePiece model, the vocabulary files, the yaml files, and the outputs obtained by each model:

Data Preprocessing

Data (MuST-C v1, en-es direction) have to be preprocessed with:

python /path/to/fbk-fairseq/examples/speech_to_text/preprocess_generic.py --data-root /data/to/mustc \
        --save-dir /data/to/mustc/save_folder --wav-dir /data/to/mustc/wav_folder \
        --split train, dev, tst-HE, tst-COMMON --vocab-type bpe --src-lang en --tgt-lang en \
        --task asr --n-mel-bins 80 --store-waveform

Training

The following parameters are intended for training on a system with 4 GPUs, each having 16 GB of VRAM. The training_data and dev_data files are in TSV format, obtained after preprocessing. The config_file is a YAML file and can be downloaded above.

python train.py /path/to/data_folder \
        --train-subset training_data --valid-subset dev_data \
        --save-dir /path/to/save_folder \
        --num-workers 5 --max-update 50000 --patience 10 --keep-last-epochs 13 \
        --max-tokens 10000 --adam-betas '(0.9, 0.98)' \
        --user-dir examples/speech_to_text \
        --task speech_to_text_ctc --config-yaml config_file \
        --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --arch conformer \
        --ctc-encoder-layer 8 --ctc-weight 0.5 \
        --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
        --warmup-updates 25000 \
        --clip-norm 10.0 \
        --seed 1 --update-freq 8 \
        --skip-invalid-size-inputs-valid-test \
        --log-format simple >> /path/to/save_folder/train.log 2> /path/to/save_folder/train.err


python /path/to/fbk-fairseq/scripts/average_checkpoints.py --input /path/to/save/folder  --num-epoch-checkpoints 5 --checkpoint-upper-bound $(ls /path/to/save_folder | head -n 5 | tail -n 1 | grep -o "[0-9]*") --output /path/to/save_folder/avg5.pt

Inference

Inference can be executed with the following command (setting TEST_DATA to a TSV obtained from the preprocessing and CONFIG_FILE to one of the YAML files provided above):

python /path/to/fbk-fairseq/fairseq_cli/generate.py /path/to/data_folder \
        --gen-subset $TEST_DATA \
        --user-dir examples/speech_to_text \
        --max-tokens 40000 \
        --config-yaml $CONFIG_FILE \
        --beam 5 \
        --max-source-positions 10000 \
        --max-target-positions 1000 \
        --task speech_to_text_ctc \
        --criterion ctc_multi_loss \
        --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --no-repeat-ngram-size 5 \
        --path /path/to/checkpoint > /path/to/output_file

Evaluation

We use the Python package JiWER to compute the word error rate. Gender-specific evaluations are performed by partitioning the test sets based on the MuST-Speaker resource.

Citation

@inproceedings{fucci2023pitch,
      title={{No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition through Pitch Manipulation}}, 
      author={Dennis Fucci and Marco Gaido and Matteo Negri and Mauro Cettolo and Luisa Bentivogli},
      year={2023},
      booktitle="IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)",
      month = dec,
      address="Taipei, Taiwan"
}