This is a recipe for training a model on images paired with untranscribed speech, and using this model for semantic keyword spotting. The model and this new task are described in the following publications:
- H. Kamper, G. Shakhnarovich, and K. Livescu, "Semantic speech retrieval with a visually grounded model of untranscribed speech," IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 27, no. 1, pp. 89-98, 2019. [arXiv]
- H. Kamper, S. Settle, G. Shakhnarovich, and K. Livescu, "Visually grounded learning of keyword prediction from untranscribed speech," in Proc. Interspeech, 2017. [arXiv]
Please cite these papers if you use the code.
A related recipe is also available, but this one is most recent recipe.
The semantic labels used here are also available separately in the
semantic_flickraudio
repository. Here we directly use processed versions of this dataset: all the
pickled files in data/
starting with 06-16-23h59
were obtained directly
from the semantic annotations.
The output of the multilabel visual classifier described below (also see vision_nn_1k/readme.md) can be downloaded directly here. We released these visual tags as part of the JSALT Rosetta project.
The code provided here is not pretty. But I believe research should be reproducible, and I hope that this repository is sufficient to make this possible for the above paper. I provide no guarantees with the code, but please let me know if you have any problems, find bugs or have general comments.
The following datasets need to be obtained:
MSCOCO and Flickr30k is used for training a vision tagging system. The Flickr8k audio and image datasets gives paired images with spoken captions; we do not use the labels from either of these. The Flickr8k text corpus is purely for reference. The Flickr8k dataset can also be browsed directly here.
data/
- Contains permanent data (file lists, annotations) that are used elsewhere.speech_nn/
- Speech systems trained on the Flickr Audio Captions Corpus.vision_nn_1k/
- Vision systems trained on Flickr30k, MSCOCO and Flickr30k+MSCOCO, but with the vocabulary given by the 1k most common words in Flickr30k+MSCOCO. Evaluation is also only for those 1k words.
Install all the standalone dependencies (below). Then clone the required GitHub
repositories into ../src/
as follows:
mkdir ../src/
git clone https://github.com/kamperh/tflego.git ../src/tflego/
Download all the required datasets (above), and then update paths.py
to point
to the corresponding directories.
Extract filterbank and MFCC features by running the steps in kaldi_features/readme.md.
Train the multi-label visual classifier by running the steps in vision_nn_1k/readme.md. Note the final model directory.
Train the various visually grounded speech models by running the steps in speech_nn/readme.md.
Standalone packages:
- Python: I used Python 2.7.
- NumPy and SciPy.
- TensorFlow: Required by the
tflego
repository below. I used TensorFlow v0.10. - Kaldi: Used for feature extraction.
Repositories from GitHub:
- tflego: A wrapper for building neural
networks. Should be cloned into the directory
../src/tflego/
.
The code is distributed under the Creative Commons Attribution-ShareAlike license (CC BY-SA 4.0).