Skip to content

Accompanying code for the paper "Discrete representations in neural models of spoken language" (https://aclanthology.org/2021.blackboxnlp-1.11)

License

Notifications You must be signed in to change notification settings

bhigy/discrete-repr

Repository files navigation

Discrete representations in neural models of spoken language

This code allows to reproduce the results presented in the paper "Discrete representations in neural models of spoken language" (https://aclanthology.org/2021.blackboxnlp-1.11).

Installation

Clone this repo and cd into it:

git clone https://github.com/bhigy/discrete-repr.git
cd discrete-repr

To create a conda environment with all dependencies, assuming conda has already been installed, run the following:

conda env create environment.yml
conda activate discrete-repr

Finally, download the metadata and extract the content of the archive in the current folder.

Training the models

Instructions to train the models can be found in docs/VISUALLY_SUPERVISED.md for visually-supervised models and in docs/SELF_SUPERVISED.md for self-supervised models.

Alternatively, pretrained models can be obtained from here. For the visually-supervised models, simply extract the content of the archive named experiments.tar.gz under the root of this repository. For the self-supervised models, follow the instructions for installation and extract the archive named checkpoints.tar.gz under the root of the folder bshall-zerospeech.

Evaluation

Dataset

Our evaluation relies on Flickr8K validation set. If you haven't already done so, follow the instructions to download and prepare the data.

Preparing stimuli for ABX

The phoneme triplets used in the ABX evaluation need to be generated by running:

python -c "import prepare_flickr8k as F8; F8.prepare_abx(k=1000, overlap=False)"

Extracting activations from the self-supervised models

While the activations from visually-supervised models are extracted automatically during evaluation, extraction of the activations from the self-supervised models is done in a separate step (for technical reasons). Assuming that the self-supervised model's repository was cloned under the same folder as current repository, this can be done by running:

cd ../bshall-zerospeech

# Preprocessing data
python preprocess_val.py in_dir=~/corpora/flickr8k dataset=flickr8k/english
python preprocess_val.py in_dir=../discrete-repr/data/flickr8k_abx_wav/ dataset=flickr8k/english_triplets

# Extraction
./vq_analyze.sh

cd ../discrete-repr

Generating results

Figures 1 and 2

The 4 subplots of figure 1 (correspondence of codes to phonemes according to the 4 metrics) and figure 2 (recall@10) can be generated by running main.py:

python main.py

The figures are generated under fig/joined_{abx,diag,rsa,vmeasure}.pdf and fig/recall_size.pdf.

Figure 3

Figure 3 (speaker identification) can be generated by running:

python speaker.py

The figure can then be found under fig/speaker_combined.pdf.

Table 3

Results presented in table 3 (correlation scores) can be generated by running:

python -c "import main; main.metric_correlation()"

The results can then be found in data/metric_correlation.{csv,tex}.

Table 4

Results presented in table 4 (skew and curtosis) can be generated by running:

python -c "import main; print(main.sim_var())"