Skip to content

Latest commit

 

History

History
233 lines (187 loc) · 7.71 KB

RESULTS.md

File metadata and controls

233 lines (187 loc) · 7.71 KB

Each of the 4 considered metrics are shallowly described at the beginning of each section. Please note that some approximations are made in favour of clarity. Please refer to the Zero Resource Speech paper for more details.

Checkpoints of the baseline can be obtained by running :

curl https://download.zerospeech.com/2021/baseline_checkpoints_VG.zip | jar xv

ABX error rate : Phoneme discriminability

In this task, the model receives 3 triplets A, B and X that are triphones, with A and X being the same triphone (but different occurences), and B differing from A/X only in its central phone (e.g /big/ vs /bug/). Under a distance function d, we expect d(A,X) < d(B,X) as A and X correspoond to the same triphone. This metric is computed over 2 conditions : within speakers, when the 3 triphones are pronounced by the same speakers, and across speakers when triphones are pronounced by different speakers. Lower is better.

LibriSpeech dev
dev-clean dev-other
Model Layer within across within across
MFCCs+VG rnn1 8.70 10.69 9.86 14.71
CPC small+VG rnn1 5.36 6.68 7.41 11.30

Layer refers to the layer used to extract representations ('rnn1' corresponds to the second recurrent layer).

sSIMI : Semantic similarity with human judgments

In this task, the model receives 2 words, let's say /abduct/ and /kidnap/. The distance between the embeddings of these 2 words are computed. Then, a Spearman's rank correlation coefficent is computed between these distances, and human semantic similarity judgements. Higher is better.

sSIMI dev
Model Layer librispeech synthetic
MFCCs+VG att 13.0894 9.4661
CPC small+VG att 11.8885 6.3074

Layer refers to the layer used to extract representations ('att' corresponds to the attention layer).

sBLIMP : Syntax acceptability judgment

In this task, the model receives two sentences, one of which is syntactically wrong. Let's say /dogs eat meat/ vs /dogs eats meat/. The model is asked to return pseudo-probabilities for each of these sentence. The pseudo-probability of the syntactically right sentence is expected to be higher than the pseudo-probability of the syntactically wrong sentence. Note that the hyperparameters used to extract the pseudo-probabilities from the model can greatly impact the performance. Higher is better.

Model K M_d Delta_t sBLIMP dev
MFCCs+VG+
KMEANS+
BERT small
50 10 1 53.19
CPC small+VG+
KMEANS+
BERT large
500 10 1 54.68

K refers to the number of clusters used in K-means. M_d and Delta_t are respectively the decoding span size and the temporal sliding size used to extract pseudo-probabilities

sWUGGY : Spot-the-word task

In this task, the model receives a word and a non-word. Let's say /brick/ and /blick/. It it asked to return pseudo-probabilities for each of these. The pseudo-probability of the word is expected to be higher than the pseudo-probability of the non-word. Higher is better.

Model K M_d Delta_t sWUGGY dev
MFCCs+VG+
KMEANS+
BERT small
50 10 1 52.53
CPC small+VG+
KMEANS+
BERT large
500 10 1 67.16

K refers to the number of clusters used in K-means. M_d and Delta_t are respectively the decoding span size and the temporal sliding size used to extract pseudo-probabilities

Test metrics

ABX test-clean ABX test-clean sSIMI
Model within across within across librispeech synthetic sWUGGY sBLIMP
MFCCs+VG+
KMEANS+
BERT small
8.39 10.59 10.66 15.03 -0.10 9.99 52.86 53.02
CPC small+VG+
KMEANS+
BERT large
5.36 6.71 7.35 11.92 0.16 9.71 67.20 54.53

Findings

Overall, using the visual modality for learning speech representations in an unsupervised way seems on par with audio only models.

  • The ABX error rate obtained by CPC small is further improved with the VG model (high-budget baseline): it goes from 6.24% ABX error rate to 5.36% ABX error rate (librispeech dev-clean, within speakers).

  • The best achievement is obtained on the semantic similarity task for which our best VG model gets a score of 13.09 compared to 8.72 for the audio-only baseline (results reported here are computed on sSIMI librispeech dev set). However, we observe a decrease on the test set for the sSIMI metric which is probably due to the fact that the dev and test sets for this metrics haven't been drawn from the same distribution, and that the test metric is averaged across multiple datasets.

  • No improvement is observed on the syntax acceptability judgment task (sBLIMP). A small decrease is observed on the spot-the-word task : 70.69% for the audio-only baseline compared to 67.16% for the multimodal baseline.