Skip to content

JAMJU/interspeech-2020-perceptimatic

Repository files navigation

Interspeech-2020-Perceptimatic

This git contains the dataset Perceptimatic presented in the paper Perceptimatic: A human speech perception benchmark for unsupervised subword modelling (Juliette Millet and Ewan Dunbar, submitted to Interspeech 2020), along with all the code to perform the analysis done in the latter.

General environment required

  • python 3.6/7
  • numpy
  • scipy
  • pandas
  • statsmodels

Dataset

Cleaned stimuli

We provide the cleaned stimuli on the from of onset and offset for the 2017 Zerospeech one second French and English stimuli. The onset, offset and labels of the cleaned French triphones are in DATA/french/all_aligned_clean_french.csv, the English are in DATA/english/all_aligned_clean_english.csv. The files have the following columns:

index #file onset offset #phone prev-phone next-phone speaker

index is how, with the language, we refer to each triphone in the rest of the files, '#file in the original 2017 ZeroSpeech wavfile, onset and offset are beginning and end of triphone in second, '#phone is the centre phone, prev-phone and next-phone are the surrounding phones, speaker is the reference numbe rof the speaker.

We provide the list of triplets used for the humans and models experiment in the file DATA/all_triplets.csv:

filename TGT OTH prev_phone next_phone TGT_item OTH_item X_item TGT_first speaker_tgt_oth speaker_x

filename is the name of the file containing the triplet (used for the humans experiment), it can be seen as the id of the triplet. If we consider the triplet as an A, B and X stimuli, with A and X in the same category, in the file, TGT is the centre phone of A and X, OTH is the center phone of B, prev_phone an next_phone are the surrounding phones. TGT_item, OTH_item and X_item refer to the indexes of the stimuli used as A, B and X. TGT_first indicate if A comes first in the file or not. We can note that each set of three extracted triphones appears in four distinct items, corresponding to orders AB--A (that is, X is another instance of the three-phone sequence A), BA--B, AB--B, and BA--A.

Human test

We give all the code to perform the human experiments, and we provide the triplet used on demand (contact juliette.millet@cri-paris.org) For each triplet, the delay between first and second stimuli is 500 milliseconds, and between second and third 650 milliseconds, as pilot subjects reported having difficulty recalling the reference stimuli when the delays were exactly equal.

Human results

Human results are in DATA/humans_and_models.csv, this file also contains delta values for the models we evaluate in the paper. Each line corresponds to a couple (triplet, participant). This file has the following columns:

  • individual : code of the individual (unique among one language group)
  • language : language group of the participant (French speaking or English speaking)
  • filename: triplet id (see section on the cleaned stimuli)
  • TGT: same as in DATA/all_triplets.csv
  • OTH: same as in DATA/all_triplets.csv
  • prev_phone: same as in DATA/all_triplets.csv
  • next_phone: same as in DATA/all_triplets.csv
  • TGT_item: same as in DATA/all_triplets.csv
  • OTH_item: same as in DATA/all_triplets.csv
  • X_item: same as in DATA/all_triplets.csv
  • TGT_first: same as in DATA/all_triplets.csv (True or False)
  • speaker_tgt_oth: same as in DATA/all_triplets.csv
  • speaker_x: same as in DATA/all_triplets.csv
  • correct_answer: human answer, either -3, -2, -1, 1, 2 or 3. If it is negative then the participant has chosen the OTH item instead of the (correct) TGT item.
  • binarized_answer: binarized version of correct_answer : -1 if correct_answer < 0, 1 otherwise
  • nb_stimuli: number of triplets heard by the participants with this triplet included (between 1 and ~190)
  • TGT_first_code: 1 if TGT_first is True, 0 otherwise
  • language_code: 1 for French participants, 0 for English participants

Analysis code

In this section we describe all the steps to evaluate any model with our methods.

Extracting features from your model

First of all you need to extract the model you want to evaluate's representations of the 2017 ZeroSpeech one second stimuli. The original wavfiles can be downloaded here: https://download.zerospeech.com/.

Our evaluation system requires that your system outputs a vector of feature values for each frame. For each utterance in the set (e.g. s2801a.wav), an ASCII features file with the same name (e.g. s2801a.fea) as the utterance should be generated with the following format (separator = ' '):

time1 val1 ... valN
time2 val1 ... valN

example:

0.0125 12.3 428.8 -92.3 0.021 43.23
0.0225 19.0 392.9 -43.1 10.29 40.02

Note

The time is in seconds. It corresponds to the center of the frame of each feature. In this example, there are frames every 10ms and the first frame spans a duration of 25ms starting at the beginning of the file, hence, the first frame is centered at .0125 seconds and the second 10ms later. It is not required that the frames be regularly spaced.

Extracting delta values from features

Once your features are in the right format, you need to put them in a global folder (called M here), put your English features in M/english/1s/, and your French features in M/french/1s. Then do:

python script_get_file_distance.py M/ DATA/all_triplets.csv $file_delta.csv$ $distance$ DATA/english/all_aligned_clean_english.csv DATA/french/all_aligned_clean_french.csv False False

If your features use the format h5f, replace the 'False False' at the end by 'True False'

$file_delta.csv is the file created by the script: it contains all the delta values for each triplet in DATA/all_triplets.csv.

$distance$ can be 'euclidean', 'kl' or 'cosine': it is the distance you want to use for the DTW. This can adapted if your representations are not numerical.

The script also print the ABX error over the Native Perceptimatic dataset (for English and French).

In order to perform the rest of the analysis easily, you can add your model delta values to our existing file containing human results, and all the model we evaluated 's delta values. To do that you need to do:

python concatenate_results $file_delta.csv$ $name_model$ DATA/humans_and_models.csv $file_all.csv$

$name_model$ if the name of the new column you add to the original file containing human results. You obtain a file ($file_all.csv$) containing all the data in humans_and_models.csv and the delta values you computed.

Computing ABX accuracies

Once $file_all.csv is created, you can compute the normal ABX accuracies over the Native Perceptimatic dataset:

python compute_results_unweighted.py $file_all.csv$ $name_model$

It will print the ABX accuracy over the Perceptimatic dataset (for French and English).

You can also compute the reweighted by human results ABX accuracy:

python compute_results_weighted_human.py $file_all.csv$ binarized_answer $name_model$

It will print the reweighted by human results ABX accuracy (for French and English).

Comparing humans' and models' results

In our paper, to compare human perceptual space and models' representational space, we study how well the delta values obtained by the models' features can predict individual human results. To do that we fit a probit regression per model using different input (see paper for details), and instead of doing it on all human results, we resample them multiple times (in order to obtain confidence intervals on the differences, see last section of this README). To do the same thing with your new model, and compare it to the models we used, you need the $file_all.csv$ created above. Do:

python probit_model_bootstrap.py $file_all.csv$ $file_log.csv$ $nb_it$

$nb_it$ corresponds to the number of resampling you want to perform. $file_log.csv$ will contain one column per model (with a index column at the beginng), each row represents a sample.

To obtain average log-likelihoods along with 95% intervals do:

python compute_log_interval.py $file_log.csv$ $final_average_log.csv

$final_average_log.csv$ will contain one row per model, each row with the values: name of the model, average log-likeliood, min 95% interval, max 95% interval.

To obtain average log-likelihood differences along with 95% intervals, do:

python compute_log_interval.py $file_log.csv$ $final_average_diff_log.csv

$final_average_diff_log.csv$ will contain one row per couple of model, each row with the values: name of the model first, name of model second, average log-likelihood difference, min 95% interval, max 95% interval.

The log-likelihood difference is loglik(name of the model first) - loglik(name of the model second).

Extracting features used in the paper

delta values obtained for the different models can be found in the file DATA/humans_and_models.csv, one column per model with the codenames given in the paper. But if you want to extract the features yourself in order to recompute the delta values, you can follow these instructions:

ZeroSpeech 2017 models

The features used for the ZS2017 models are the ones submitted to the 2017 ZeroSpeech challenge. The latter can be found online, by clicking on Zenodo links that are listed here: https://zerospeech.com/2017/results.html

Here is a list of the models we evaluated, and their corresponding Zenodo number (we also precise if the downloaded features are already .fea, h5f or need to be modified)

Model Zenodo number type distance used
S2 823695 fea cosine
S1 815089 fea cosine
H 821246 fea kl
P1 819892 h5f cosine
P2 820409 h5f cosine
A3 823546 h5f cosine
Y1 814335 need modif cosine
Y2 814579 fea cosine
Y3 814566 fea cosine
C1 822737 fea cosine
C2 808915 need modif cosine

Topline

The topline used is a supervised GMM-HMM model with a bigram language model, trained with a Kaldi recipe. We used exactly the same model than for the 2017 Zerospeech challenge. We cannot provide the models themselves (one trained on the 2017 Zerospeech French training set, the other on the English training set), but we provide the posteriorgrams extracted on demand (contact juliette.millet@cri-paris.org)

MFCCs

The MFCCs used in the paper were extracted with Kaldi toolkit, using the default paramters, adding the first and second derivatives for a total of 39 dimensions, and we applymean-variance normalization over a moving 300 milliseconds window. We provide the extracted MFCCs on demand on demand (contact juliette.millet@cri-paris.org)

Multilingual bottleneck

We used the Shennong package (https://github.com/bootphon/shennong) to extract the multilingual botteneck features described in [2]

To extract these features you need the Shennong package installed (added to the list of requirements listed above). To extract features from wavfiles in a folder F to a folder G do

python extract_from_shennong_bottleneck.py F G BabelMulti

DPGMM

We use the kaldi toolkit to extract MFCCs and apply the same VTLN than in [1] (the vtln-mfccs can be provided on demand, contact juliette.millet@cri-paris.org), then we extract the posteriorgrams from the French and English models from [1] we follow the instructions of https://github.com/geomphon/CogSci-2019-Unsupervised-speech-and-human-perception

Other notes

Details results of the sampling

To study how well each model is able to predict human results, we fit a probit model using delta values and a set of other parameters on human results for each model. Instead of doing it on all human results, we do it on mutliple subsamples (N=13682, for each stimulus, we draw three observations --human binary answer-- without replacement). In the paper, we provide only the mean log-likelihood obtained by this resampling, but here we present the mean differences between models, and indicate if the difference is significant. The following table contains the mean of the differences. The values are in bold if the difference is significant (ie if the 95% interval is above zero).

MFCC P2 P1 Y2 Y1 C2 Y3 C1 A3 H S1 Bot DP S2
topline 63.3 91.1 85.9 106.8 100.0 107.1 110.8 124.3 160.5 177.4 212.0 216.3 236.3 252.1
MFCC - 27.8 22.5 43.5 36.7 43.8 47.5 61.0 97.2 114.1 148.7 153.0 173.0 188.8
P2 - -5.2 15.7 8.9 16.0 19.7 33.1 69.4 86.3 120.9 125.2 145.1 160.9
P1 - 20.9 14.1 21.2 24.9 38.3 74.6 91.5 126.1 130.4 150.4 166.2
Y2 - -6.7 0.3 4.0 17.5 53.7 70.6 105.2 109.6 129.5 145.2
Y1 - 7.1 10.8 24.2 60.4 77.4 112.0 116.3 136.2 152.0
C2 - 3.7 17.1 53.4 70.3 104.9 109.2 129.1 144.9
Y3 - 13.4 49.7 66.5 101.2 105.5 125.4 141.2
C1 - 36.2 53.2 87.7 92.1 112.0 127.8
A3 - 16.8 51.5 55.8 75.7 91.5
H - 34.6 38.9 58.8 74.6
S1 - 4.4 24.3 40.1
Bot - 19.9 35.7
DP - 15.8

References:

[1] Millet, J., Jurov, N., & Dunbar, E. (2019, July). Comparing unsupervised speech learning directly to human performance in speech perception. [2] Fer, R., Matějka, P., Grézl, F., Plchot, O., Veselý, K., & Černocký, J. H. (2017). Multilingually trained bottleneck features in spoken language recognition. Computer Speech & Language, 46, 252-267.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages