This code allows to reproduce the results presented in the paper "Discrete representations in neural models of spoken language" (https://aclanthology.org/2021.blackboxnlp-1.11).
Clone this repo and cd into it:
git clone https://github.com/bhigy/discrete-repr.git
cd discrete-repr
To create a conda environment with all dependencies, assuming conda has already been installed, run the following:
conda env create environment.yml
conda activate discrete-repr
Finally, download the metadata and extract the content of the archive in the current folder.
Instructions to train the models can be found in docs/VISUALLY_SUPERVISED.md for visually-supervised models and in docs/SELF_SUPERVISED.md for self-supervised models.
Alternatively, pretrained models can be obtained from here. For the visually-supervised models, simply extract the content of the archive named experiments.tar.gz
under the root of this repository. For the self-supervised models, follow the instructions for installation and extract the archive named checkpoints.tar.gz
under the root
of the folder bshall-zerospeech
.
Our evaluation relies on Flickr8K validation set. If you haven't already done so, follow the instructions to download and prepare the data.
The phoneme triplets used in the ABX evaluation need to be generated by running:
python -c "import prepare_flickr8k as F8; F8.prepare_abx(k=1000, overlap=False)"
While the activations from visually-supervised models are extracted automatically during evaluation, extraction of the activations from the self-supervised models is done in a separate step (for technical reasons). Assuming that the self-supervised model's repository was cloned under the same folder as current repository, this can be done by running:
cd ../bshall-zerospeech
# Preprocessing data
python preprocess_val.py in_dir=~/corpora/flickr8k dataset=flickr8k/english
python preprocess_val.py in_dir=../discrete-repr/data/flickr8k_abx_wav/ dataset=flickr8k/english_triplets
# Extraction
./vq_analyze.sh
cd ../discrete-repr
The 4 subplots of figure 1 (correspondence of codes to phonemes according to the 4 metrics) and figure 2 (recall@10) can be generated by
running main.py
:
python main.py
The figures are generated under fig/joined_{abx,diag,rsa,vmeasure}.pdf
and fig/recall_size.pdf
.
Figure 3 (speaker identification) can be generated by running:
python speaker.py
The figure can then be found under fig/speaker_combined.pdf
.
Results presented in table 3 (correlation scores) can be generated by running:
python -c "import main; main.metric_correlation()"
The results can then be found in data/metric_correlation.{csv,tex}
.
Results presented in table 4 (skew and curtosis) can be generated by running:
python -c "import main; print(main.sim_var())"