Multimodal One-Shot Learning of Speech and Images

Overview

This repository contains the full code recipe for building models that can acquire novel concepts from only one paired audio-visual example per class, without receiving any hard labels. These models can then be used to match new continuous speech input to the correct visual instance (e.g. the spoken word "lego" is matched to the visual signal of lego, without receiving any textual labels, and after seeing only a single paired speech-image example of a different lego instance). This is multimodal one-shot learning, a new task which we formalise in the following paper:

R. Eloff, H. A. Engelbrecht, H. Kamper, "Multimodal One-Shot Learning of Speech and Images," arXiv preprint arXiv:1811.03875, 2018. [arXiv]

Please cite this paper if you use the code.

Datasets

The following datasets are required for these experiments:

Note that the Flickr8k text corpus is used purely for obtaining train/validation/test splits. The instructions that follow assume that you have obtained these datasets and placed them somewhere sensible (e.g. ../data/tidigits).

Pre-requisites

The following steps need to be completed before running the experiment scripts:

Install Docker (I also recommend following the linux post-install step to manage Docker as a non-root user)
Install nvidia-docker (version 2.0) for NVIDIA GPU access in docker containers
Pull required images from Docker Hub:

Docker image	Docker pull command
Kaldi for extracting speech features	`docker pull reloff/kaldi:5.4`
TensorFlow used as base for research environment	`docker pull reloff/tensorflow-base:1.11.0-py36-cuda90`
Multimodal one-shot research environment	`docker pull reloff/multimodal-one-shot`

Alternatively you can build these images locally from their DockerFiles:

Kaldi feature extraction

Extract speech features by simply running run_feature_extraction [OPTIONS] (use --help flag for more information):

./run_feature_extraction.sh \
    --tidigits=<path to TIDigits> \
    --flickr-audio=<path to Flickr audio> \
    --flickr-text=<path to Flickr8k text> \
    --n-cpu-cores=<number of CPU cores>

Replace each path with the full path to the corresponding dataset. The --n-cpu-cores flag specifies the number of CPU cores used for feature extraction (defaults to 8; set higher or lower depending on available CPU cores), where more cores may speed up the process. For example:

./run_feature_extraction.sh --tidigits=/home/rpeloff/datasets/datasets/speech/tidigits --flickr-audio=/home/rpeloff/datasets/speech/flickr_audio --flickr-text=/home/rpeloff/datasets/text/Flickr8k_text --n-cpu-cores=8

Train and test multimodal models

The multimodal one-shot models are demonstrated in two separate Jupyter notebooks:

experiments/nb1_unimodal_train_test.ipynb trains and tests unimodal models for one-shot speech or image classification
experiments/nb2_multimodal_test.ipynb extends unimodal models to the multimodal one-shot case, testing on one-shot cross-modal speech-image digit matching

To run these notebooks and reproduce the results in the paper, execute the run_notebooks.sh [OPTIONS] script (use --help flag for more information),

./run_notebooks.sh --port=8888

and navigate to http://127.0.0.1:8888/. Follow the experiment notebooks and execute the code cells to train, test, and summarise the unimodal and multimodal one-shot models.

Note

All code used for the paper is present in this repo, and the experiment notebooks should reproduce all results. If you find any mistakes in the code or notebooks, please let us know by raising an issue! Also feel free to raise issues if you have general comments! 😄

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docker		docker
experiments		experiments
kaldi_features		kaldi_features
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run_feature_extraction.sh		run_feature_extraction.sh
run_notebooks.sh		run_notebooks.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker

docker

experiments

experiments

kaldi_features

kaldi_features

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

run_feature_extraction.sh

run_feature_extraction.sh

run_notebooks.sh

run_notebooks.sh

Repository files navigation

Multimodal One-Shot Learning of Speech and Images

Overview

Datasets

Pre-requisites

Kaldi feature extraction

Train and test multimodal models

Note

About

Releases

Packages

Languages

License

rpeloff/multimodal_one_shot_learning

Folders and files

Latest commit

History

Repository files navigation

Multimodal One-Shot Learning of Speech and Images

Overview

Datasets

Pre-requisites

Kaldi feature extraction

Train and test multimodal models

Note

About

Resources

License

Stars

Watchers

Forks

Languages