Skip to content

Latest commit

 

History

History
250 lines (173 loc) · 13.4 KB

README.md

File metadata and controls

250 lines (173 loc) · 13.4 KB

WAWEnets Python Inference and Training Code

Implements Wideband Audio Waveform Evaluation networks or WAWEnets.

This WAWEnets implementation produces one or more speech quality or intelligibility values for each input speech signal without using reference speech signals. WAWEnets are convolutional networks and they have been trained using full-reference objective speech quality and speech intelligibility values as well as subjective scores.

the .pt model files in ./wawenets/weights are plain pytorch model files, and are suitable for creating new traced JIT files for C++ or ONNX in the future.

Details can be found in the ICASSP 2020 WAWEnets paper1 and followup article6.

If you need to cite our work, please use the following:

@INPROCEEDINGS{
   9054204,
   author={A. A. {Catellier} and S. D. {Voran}},
   booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
   title={Wawenets: A No-Reference Convolutional Waveform-Based Approach to Estimating Narrowband and Wideband Speech Quality},
   year={2020},
   volume={},
   number={},
   pages={331-335},
}

Inference Setup

In order to run the WAWEnets Python code, some initial setup is required. Please follow the instructions below to prepare your machine and environment.

SoX

SoX is an audio processing library and CLI tool useful for format conversions, padding, and trimming, among other things.

To install SoX on a Debian-based Linux, use the apt package manager:

apt install sox

On macOS the easiest way to install SoX is by using brew. Follow the instructions to install brew, then use brew to install SoX:

brew install sox

In order to install SoX on Windows, follow the instructions on the SoX SourceForge page.

ITU-T Software Tool Library (STL)

The Python WAWEnets implementation relies on ITU-T STL executables in order to resample audio files and measure speech levels. We're using a few STL utilities (actlev, filter, and sv56demo) for some functions that are also available in torchaudio because this allows us to be reasonably certain that the audio processing steps are the same among all WAWEnets implementations (C++, MATLAB, etc.)

First we must compile the STL executables. To do this, clone the STL repo and then follow the build procedure

After the build procedure is complete, return to the WAWEnets Python implementation. Create a copy of config.yaml.template named config.yaml:

cp wawenets/config/config.yaml.template wawenets/config/config.yaml

Edit config.yaml to point to the bin dir where the STL tools have been compiled, e.g. /path/to/STL/bin.

Python Conda Environment

One way to install the Python libraries required to run the Python version of WAWENets is using Anaconda (or Miniconda). Once Anaconda or Miniconda are installed, use the following commands to set up and activate a new conda env:

conda env create -f wenets_env.yaml
conda activate wenets_dist

After the Anaconda environment has been created and activated, execute the following code to install and test the wawenets package:

cd wawenets
poetry install
pytest

Usage

After successfully completing the above steps, it should be possible to run the following command:

python wawenets_cli.py --help

and see its output:

Usage: wawenets_cli.py [OPTIONS]

  the CLI interface Python WAWEnets produces quality or intelligibility
  estimates for specified speech files.

Options:
  -m, --mode INTEGER     specifies a WAWEnet mode, default is 1
  -i, --infile TEXT      either a .wav file or a .txt file where each line
                         specifies a suitable .wav file. if the latter, files
                         will be processed in sequence.  [required]
  -l, --level BOOLEAN    whether or not contents of a given .wav file should
                         be normalized. default is True.
  -s, --stride INTEGER   stride (in samples @16k samp/sec) on which to make
                         predictions. default is 48,000, meaning if a .wav
                         file is longer than 3 seconds, the model will
                         generate a prediction for neighboring 3-second
                         segments.
  -c, --channel INTEGER  specifies a channel to use if .wav file has multiple
                         channels. default is 1 using indices starting at 1
  -o, --output TEXT      path where a CSV file containing predictions should
                         be written. default is None, and results are printed
                         to stdout
  --help                 Show this message and exit.

Arguments

infile is either a .wav file or a .txt file where each line specifies the path to a suitable .wav file. In this second case, the listed .wav files will be processed in sequence. NOTE: when using a .txt file to specify which .wav files to process, the software will always process the first channel of each file.

A suitable .wav file must:

  • be uncompressed
  • have sample rate 8, 16, 24, 32, or 48k smp/sec.
  • contain at least 3 seconds of speech

To best match the designed scope of WAWEnets, the .wav file should have a speech activity factor of roughly 0.5 or greater and an active speech level near 26 dB below the clipping points of +/- 1.0 (see level normalization feature below). The native sample rate for WAWEnets is 16 k smp/sec so files with rates 8, 24, 32, or 48k rate are converted internally before processing.

-m M specifies a WAWEnet mode. The integer M specifies the WAWEnet trained using a specific full-reference target.

  • -m 1: WAWEnet trained using WB-PESQ2 target values (Default)
  • -m 2: WAWEnet trained using POLQA3 target values
  • -m 3: WAWEnet trained using PEMO4 target values
  • -m 4: WAWEnet trained using STOI5 target values
  • -m 5: WAWEnet trained using seven objective targets: WB-PESQ, POLQA, STOI, PEMO, ViSQOL3 (c310), ESTOI, and SIIBGauss6
  • -m 6: WAWEnet trained using four subjective targets (mos, noi, col, dis) and seven objective targets (WB-PESQ, POLQA, STOI, PEMO, ViSQOL3 (c310), ESTOI, and SIIBGauss)6

-l L specifies internal level normalization of .wav file contents to 26 dB below clipping.

  • -l 0: normalization off
  • -l 1: normalization on (Default)

-s S specifies specifies the segment step (stride) and is an integer with value 1 or greater. Default is -s 48,000. WAWEnet requires a full 3 seconds of signal to generate a result. If a .wav file is longer than 3 seconds multiple results may be produced. S specifies the number of samples to move ahead in the speech file when extracting the next segment. The default value of 48,000 gives zero overlap between segments. Using this default any input less than 6 seconds will produce one result, based on just the first 3 seconds. A 6 second input will produce two results. If -s 24,000 for example, segment overlap will be 50%, a 4.5 second input will produce 2 results and a 6 second input will produce 3 results.

-c C specifies a channel number to use when the input speech is in a multi-channel .wav file. Default is -c 1. NOTE: when using a .txt file to specify which .wav files to process, the software will always process the first channel of each file.

-o 'myFile.txt' specifies a text file that captures WAWEnet results on a new line for each speech input processed. If the file exists it will be appended to. The extension .txt will be added as needed. Default is that no .txt file is generated.

Outputs

The output for each of the N speech signals processed is in the format:

[row] [wavfile] [channel] [sample_rate] [duration] [level_normalization] [segment_step_size] [WAWEnet_mode] [segment_number] [start_time] [stop_time] [active_level] [speech_activity] [model_prediction]

where:

  • row an identifier for the current row of output
  • wavfile is the filename that has been processed
  • channel is the channel of wavfile that has been processed
  • sample_rate native sample rate of the wavfile
  • duration duration of wavfile in seconds
  • level_normalization reflects whether wavfile was normalized during processing
  • segment_step_size reflects the segment step (stride) used to process wavfile
  • WAWEnet_mode is the mode wavfile has been processed with
  • segment_number is a zero-based index that indicates which segment of wavfile was processed
  • start_time is the time in seconds where the current segment began within wavfile
  • stop_time is the time in seconds where the current segment ended within wavfile
  • active_level active speech level of the specified segment of wavfile in dB below overload
  • speech_activity is the speech activity factor of the last specified segment of wavfile
  • model_prediction output value produced by WAWEnet for the specified segment of wavfile

Internally, pandas is used to generate the text output. If the -o option is specified, pandas generates a CSV and writes it to the given file path.

Training Setup

Inside wawenets/wawenet_trainer you will find code that will train WAWEnets. Unfortunately, we are not able to share any data to train on, but you can easily build your own dataset and use this code to train a WAWEnet custom for your application.

Audio Data

WAWEnets accept .wav files with a sample rate of 16,000 samples/second and are exactly 3 seconds long. Put your .wav files in a specific location, and use that location for the train.py argument --data_root_path.

Audio Metadata/Target Definition

The training code will read either pandas dataframe-style JSON files or CSVs. These dataframes should have the following columns:

  • filename: the name of the file described by this row
  • split: either TRAIN, TEST, VAL, or UNSEEN. For which part of the training process should this file be used?
  • impairment: what speech processing impairment does this file exhibit?
  • datasetLanguage: what language are the talkers in this dataset speaking?
  • [TARGET_NAME]: include any target values you'd like to imitate
  • [FILE_METADATA]: (optional) any metadata you'd like to perhaps act on later Any dataframe in this format can be used as the argument --csv_path.

Python Conda Environment

One way to install the Python libraries required to run the Python version of WAWENets is using Anaconda (or Miniconda). Once Anaconda or Miniconda are installed, use the following commands to set up and activate a new conda env:

conda env create -f wenets_train_env.yaml
conda activate wenets_train

After the Anaconda environment has been created and activated, execute the following code to install and test the wawenets package:

cd wawenets
poetry install

Training

The training entrypoint is train.py. It has extensive options all exposed to the command line via arguments:

python train.py [ARGS]

There are preset configurations that will define most of these options for you. Using generic_regime for the --training_regime argument is a good start.

Train your net!

python train.py --training_regime generic_regime --csv_path /path/to/csv --data_root_path /path/to/data

By default, results will be logged to ~/wenets_training_artifacts and they will include dataframe result summaries as well as 2D-histograms showing predictions vs. actual values.


1 Andrew A. Catellier & Stephen D. Voran, "WAWEnets: A No-Reference Convolutional Waveform-Based Approach to Estimating Narrowband and Wideband Speech Quality," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 331-335.

2 ITU-T Recommendation P.862, "Perceptual evaluation of speech quality (PESQ)," Geneva, 2001.

3 ITU-T Recommendation P.863, "Perceptual objective listening quality analysis," Geneva, 2018.

4 R. Huber and B. Kollmeier, "PEMO-Q — A new method for objective audio quality assessment using a model of auditory perception," IEEE Trans. ASLP, vol. 14, no. 6, pp. 1902-1911, Nov. 2006.

5 C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "An algorithm for intelligibility prediction of time-frequency weighted noisy speech," IEEE Trans. ASLP, vol. 19, no. 7, pp. 2125-2136, Sep. 2011.

6 Andrew Catellier & Stephen Voran, "Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities," arXiv preprint, Jun. 2022.