Word Discovery in Visually Grounded, Self-Supervised Speech Models

This is the official codebase for paper Word Discovery in Visually Grounded, Self-Supervised Speech Models.

@inproceedings{peng2022word,
  title={Word Discovery in Visually Grounded, Self-Supervised Speech Models},
  author={Peng, Puyuan and Harwath, David},
  booktitle={Interspeech},
  year={2022}
}

It is recommended to create a new conda environment for this project with conda create -n wd python=3.9, the requirement on python version is not rigid, as long as you can install the packages listed in ./requirements.txt. The requirement for the versions of the packages is not rigid either, while the listed versions were tested, higher/lower versions might also work.

If you want to get the attention weights of different attention head (which is required for all word and boundary detection experiments), you need to modify the output of the multi_head_attention_forward function in the PyTorch package attorch/nn/functional. if you install pytorch using conda in environment wd, the path of the file should be path_to_conda/envs/wd/lib/python3.9/site-packages/torch/nn/functional.py. get to function multi_head_attention_forward, and change the output as the following

    # if need_weights:
    #     # average attention weights over heads
    #     attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)
    #     return attn_output, attn_output_weights.sum(dim=1) / num_heads
    # else:
    #     return attn_output, None
    attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)
    return attn_output, attn_output_weights

Simply put, originally, the return of attn_output_weights is summed over all attention heads, and we don't want to do that so that we can have the attention weights from different heads.

Note that since PyTorch 1.11, multi_head_attention_forward accepts argument average_weights which controls whether returning averaged attention or unaveraged attention. However, for minimal code change, we recommend ignore this argument and change the code in multi_head_attention_forward as

    # if need_weights:
    #     # optionally average attention weights over heads
    #     attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)
    #     if average_attn_weights:
    #         attn_output_weights = attn_output_weights.sum(dim=1) / num_heads

    #     if not is_batched:
    #         # squeeze the output if input was unbatched
    #         attn_output = attn_output.squeeze(1)
    #         attn_output_weights = attn_output_weights.squeeze(0)
    #     return attn_output, attn_output_weights
    # else:
    #     if not is_batched:
    #         # squeeze the output if input was unbatched
    #         attn_output = attn_output.squeeze(1)
    #     return attn_output, None
    attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)
    return attn_output, attn_output_weights

2. Apply VG-HuBERT on Speech Segmentation

To enable quickly applying the VG-HuBERT on speech segmentation, we provide the following standalone script. You need to provide four arguments to make it run:

model_path. It should be the directory the .pth and args.pkl are at. We open provide two trained models, VG-HuBERT_3 and VG-HuBERT_4. VG-HuBERT_3 should performance better on speech segmentation. Please untar the file after downloading.
wav_file. The speech file you want to segment, we recommend the length of the speech to be 1 ~ 8 seconds, although in our experience the segmentation performance of VG-HuBERT is robust to the length of the input. the file should be SoundFlie Readable, i.e. .wav, .flac etc.
tgt_layer and threshold. The former is the layer from which you want to use the attention, the later is the threshold on the attention weights, below which are set to 0. For VG-HuBERT_3, we recommend (tgt_layer, threshold) = (9, 0.7) and for VG-HuBERT_4, we recommend (tgt_layer, threshold) = (9, 0.8). Note that layer is 0-based, i.e. layer number starts from 0.

model_path = # TODO
wav_file = # TODO
tgt_layer = # TODO
threshold = # TODO

import torch
import soundfile as sf
from models import audio_encoder
from itertools import groupby
from operator import itemgetter

def cls_attn_seg(cls_attn_weights, threshold, spf, audio_len_in_sec):
    threshold_value = torch.quantile(cls_attn_weights, threshold, dim=-1, keepdim=True) # [n_h, T]
    boundary_idx = torch.where((cls_attn_weights >= threshold_value).float().sum(0) > 0)[0].cpu().numpy()
    attn_boundary_intervals = []
    word_boundaries_intervals = []
    for k, g in groupby(enumerate(boundary_idx), lambda ix : ix[0] - ix[1]):
        seg = list(map(itemgetter(1), g))
        t_s, t_e = seg[0], min(seg[-1]+1, cls_attn_weights.shape[-1])
        if len(seg) > 1:
            attn_boundary_intervals.append([spf*t_s, spf*t_e])
    # take the mid point of adjacent attn boundaries as the predicted word boundaries
    word_boundary_list = [attn_boundary_intervals[0][0]/2] # fist boundary
    for left, right in zip(attn_boundary_intervals[:-1], attn_boundary_intervals[1:]):
        word_boundaries_list.append((left[1]+right[0])/2.)
    word_boundaries_line.append((attn_boundary_intervals[-1][1]+audio_len_in_sec)/2) # last boundary
    for i in range(len(word_boundaries_list)-1):
        word_boundary_intervals.append([word_boundaries_line[i], word_boundaries_line[i+1]])
    return {"attn_boundary_intervals": attn_boundary_intervals, "word_boundary_intervals": word_boundary_intervals}

# setup model
with open(os.path.join(model_path, "args.pkl"), "rb") as f:
    model_args = pickle.load(f)
model = audio_encoder.AudioEncoder(model_args)
bundle = torch.load(os.path.join(model_path, "best_bundle.pth"))
model.carefully_load_state_dict(bundle['dual_encoder'], load_all=True)
model.eval()
model = model.cuda()

# load waveform (do not layer normalize the waveform!)
audio, sr = sf.read(wav_file, dtype = 'float32')
assert sr == 16000
audio_len_in_sec = len(audio) / sr
audio = torch.from_numpy(audio).unsqueeze(0).cuda() # [T] -> [1, T]

# model forward
with torch.no_grad():
    model_out = model(audio, padding_mask=None, mask=False, need_attention_weights=True, tgt_layer=tgt_layer)
feats = model_out['features'].squeeze(0)[1:] # [1, T+1, D] -> [T, D]
spf = audio.shape[-1]/sr/feats.shape[-2]
attn_weights = model_out['attn_weights'].squeeze(0) # [1, num_heads, T+1, T+1] -> [num_heads, T+1, T+1] (for the two T+1, first is target length then the source)
cls_attn_weights = attn_weights[:, 0, 1:] # [num_heads, T+1, T+1] -> [num_heads, T]
out = cls_attn_seg(cls_attn_weights, threshold, spf, audio_len_in_sec) # out contains attn boundaries and word boundaries in intervals

3. Speech Segmentation and Word Detection on SpokenCOCO

This section illustrates how to apply the VG-HuBERT model to segment speech and detect words in SpokenCOCO. Please first download the SpokenCOCO audios and MSCOCO images following:

coco_root=/path/to/coco/
wget https://data.csail.mit.edu/placesaudio/SpokenCOCO.tar.gz -P ${coco_root} # 64G
wget http://images.cocodataset.org/zips/train2014.zip -P {coco_root}
wget http://images.cocodataset.org/zips/val2014.zip -P {coco_root}

Please untar/unzip the compressed files after downloading them

Then download karpathy split json files with word alignment

wget https://www.cs.utexas.edu/~harwath/model_checkpoints/vg_hubert/karpathy_json.tar -P ${coco_root}/SpokenCOCO/

Please also untar it.

Then you are all set, just run

cd ./scripts
bash run_spokencoco.sh vg-hubert_3 9 4096 0.7 max clsAttn 1

after this please find vg-hubert_3_spokencoco_max_0.7_9_clsAttn.log in ./logs/ with the following content:

AScore: 0.6591
IoU: 0.6151
IoT: 0.6603
Percentage that the segment falls within the word interval: 65.79
Average distance (in seconds) between segment center and word center: 0.0647
Percentage that word centers fall into the code segment: 85.47%
code coverage (average over all *words*): 0.7098
code coverage (average over all *word types*): 0.9623
coverage per sentence: 0.7095
boundary precision: 0.3619
boundary recall: 0.2721
boundary F1: 0.3107
boundary over-segmentation: -0.2480
boundary R value: 0.4459
number of total codes occurance: 179179
purity (using most occured word as code assignment): 0.759
purity (using highest f1 word as code assignment): 0.681
nmi: 0.115
"a big table on discovered words, you can comment the use of `print_code_stats_by_f1` on around line 857 to disable that"
1014 / 6085 words with an F1 score >= 0.5
avg F1 = 89.30% for top 250 words; 23.81% for all 6001 words
done

To get the note that we set different layers and threshold for getting the best Area group, Boundary group and Word group result (for def of the metrics in these groups, plz refer to the paper). The hyperparameters and results on SpokenCOCO test set for VG-HuBERT, VG-HuBERT_3, and VG-HuBERT_4 are ():

Area Group

Model	threshold	layer	A-score	WC	tIoU	ToU	CD
VG-HuBERT	0.9	9	54.97	73.16	44.02	47.25	0.0890
VG-HuBERT_3	0.7	9	65.77	70.94	61.29	65.85	0.0646
VG-HuBERT_4	0.6	10	63.97	73.97	56.35	60.97	0.0789

Boundary Group

Model	threshold	layer	precision	recall	F1	OS	R-val
VG-HuBERT	0.9	9	18.31	18.90	18.60	0.0326	0.2960
VG-HuBERT_3	0.7	9	35.90	27.03	30.84	-0.2472	0.4442
VG-HuBERT_4	0.8	9	28.39	25.64	26.94	-0.0970	0.3964

Word Group

Model	threshold	layer	WD	WD std	purity	Pur. std
VG-HuBERT	0.93	9	823	20	63.2	0.4
VG-HuBERT_3	0.95	10	1167	26	75.3	0.2
VG-HuBERT_4	0.99	9	1230	18	75.2	0.2

4. Apply VG-HuBERT on Buckeye Segmentation and ZeroSpeech2020

This section illustrate how evaluate VG-HuBERT on Buckeye and ZeroSpeech2020 (i.e. to get result in table 4 and 5 in our word discovery paper)

4.1 Buckeye

Buckeye evaluating largely utilize Herman's repo, which accompanies his recent paper. Thank Herman for publish the paper and codebase at the same time!

First of all, please create a folder for all the data you will download to later, let's name it /Buckeye/ Please follow the original website to obtain the Buckeye dataset, put all s* folders in /Buckeye/. After that, download the VAD file from Herman repo and put them input /Buckeye/vad/ folder. Lastly, download the ground truth word alignment file from this link, extract all three folders and put them in /Buckeye/buckeye_segment_alignment.

After downloading is done. change the model_root and data_root in ./scripts/run_buckeye.sh, then

cd scripts
bash run_buckeye.sh vg-hubert_3 9 0.7 mean clsAttn buckeyetest

you should get

---------------------------------------------------------------------------
Word boundaries:
Precision: 36.19%
Recall: 32.22%
F-score: 34.09%
OS: -10.98%
R-value: 45.58%
---------------------------------------------------------------------------
Word token boundaries:
Precision: 25.07%
Recall: 23.17%
F-score: 24.08%
OS: -7.58%

Please note that I previous put the below results which is wrong as I set the tolerance to be 30ms by mistake. Thank Chung-Ming for pointing this out:

---------------------------------------------------------------------------
Word boundaries:
Precision: 47.56%
Recall: 42.34%
F-score: 44.80%
OS: -10.98%
R-value: 54.15%
----------------------------
Word token boundaries:
Precision: 32.26%
Recall: 29.81%
F-score: 30.99%
OS: -7.58%

4.2 ZeroSpeech2020

We'll do the ZS2020 Spoken Term Discovery track (just English).

First follow the ZeroSpeech 2020 section on the ZeroSpeech website to download the data and ground truth labels (remember to also download 2017_vads.zip). This should be free and easy, like the rest of the steps :). Suppose you have put the 2020 folder at /zs20/. Remember to put ENGLISH_VAD.csv from 2017_vads.zip at /2020/ENGLISH_VAD.csv.

Then install zerospeech 2020 evaluation toolkit following this official repo. Assume you have clone the repo at ~/zerospeech2020.

Now you should be ready to test the models on this task. similarly, change the model_root and data_root in ./scripts/run_zs20.sh to the parent folder of your model folder and data folder (for data_root, is should be /zs20 if you follow the above)

Then run

cd ./scripts
bash run_zs20.sh vg-hubert_3 9 16384 0.7 max clsAttn 1

The above run takes a long time, so better run it as a sbatch job, or open a tmux window for it. After the job finishes, you should be able to find the output file ~/zerospeech2020/english_vg-hubert_3_max_0.7_9_clsAttn_16384.json with content:

{
    "2017-track2": {
        "english": {
            "scores": {
                "ned": 0.42968707380581783,
                "coverage": 0.9552742717879052,
                "words": 93769
            },
            "details": {
                "boundary_precision": 0.4477182844906101,
                "boundary_recall": 0.5690987635775963,
                "boundary_fscore": 0.5011637494055812,
                "grouping_precision": "NA",
                "grouping_recall": "NA",
                "grouping_fscore": "NA",
                "token_precision": 0.1728015507130922,
                "type_precision": 0.06441361217459927,
                "token_recall": 0.16639803706534623,
                "type_recall": 0.28718143780905286,
                "token_fscore": 0.16953935014383406,
                "type_fscore": 0.1052255642372453,
                "words": 93769,
                "coverage": 0.9552742717879052,
                "ned": 0.42968707380581783,
                "pairs": 6122486
            }
        }
    }
}

5. Training

Feel free to skip this section if you don't want to train the model.

If you do want to train a VG-HuBERT model, please first download the weight of pretrained HuBERT and DINO-ViT, we use HuBERT Base trained on librispeech 960h unsup speech and DINO ViT Small 8x8,

We can download the weights via

pretrained_root=/path/to/pretrained/hubertAndDINO
wget https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt -P ${pretrained_root}
wget https://dl.fbaipublicfiles.com/dino/dino_deitsmall8_pretrain/dino_deitsmall8_pretrain_full_checkpoint.pth -P ${pretrained_root}

Then, Please follow section 3 to download SpokenCOCO dataset (don't forget the json file).

After that, change the pretrained_root, model_root, and data_root to desired dir, you are good to go

cd ./scripts
bash training.sh

If you want to train the model on Places, please follow

# Images
# please follow http://places.csail.mit.edu/downloadData.html

# spoken captions (85G)
places_root=${data_root}/places
wget https://data.csail.mit.edu/placesaudio/placesaudio_2020_splits.tar.gz -P ${places_root}
cd ${places_root}
tar -xf placesaudio_2020_splits.tar.gz

6. Speech-Image Retrieval on SpokenCOCO

This section test VG-HuBERT_x on validation set of SpokenCOCO. Make sure you have model_root and data_root point to the right dir in ./scripts/validate.sh.

run

cd ./scripts
bash validate.sh

If you don't change other settings in validate.sh, the above run VG-HuBERT_3 on the val split of SpokenCOCO.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
datasets		datasets
models		models
scripts		scripts
steps		steps
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
buckeye_eval.py		buckeye_eval.py
prepare_zs21_submit.py		prepare_zs21_submit.py
requirements.txt		requirements.txt
run_kmeans.py		run_kmeans.py
run_places.py		run_places.py
run_spokencoco.py		run_spokencoco.py
save_seg_feats.py		save_seg_feats.py
save_seg_feats_buckeye.py		save_seg_feats_buckeye.py
save_seg_feats_zs21.py		save_seg_feats_zs21.py
unit_analysis.py		unit_analysis.py
visualize_cls_attn.ipynb		visualize_cls_attn.ipynb

License

jasonppy/word-discovery

Folders and files

Latest commit

History

Repository files navigation

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Table of Contents

1. Environment

2. Apply VG-HuBERT on Speech Segmentation

3. Speech Segmentation and Word Detection on SpokenCOCO

Area Group

Boundary Group

Word Group

4. Apply VG-HuBERT on Buckeye Segmentation and ZeroSpeech2020

4.1 Buckeye

4.2 ZeroSpeech2020

5. Training

6. Speech-Image Retrieval on SpokenCOCO

About

Resources

License

Stars

Watchers

Forks

Languages