GitHub - DeepRNN/visual_question_answering: Tensorflow implementation of "Dynamic Memory Networks for Visual and Textual Question Answering"

Introduction

This neural system for visual question answering is roughly based on the paper "Dynamic Memory Networks for Visual and Textual Question Answering" by Xiong et al. (ICML2016). The input is an image and a question about the image, and the output is a one-word answer to this question. It uses a convolutional neural network to extract visual features from the image, and uses a bi-directional GRU recurrent neural network to fuse these features. Meanwhile, it uses either a GRU recurrent neural network or a positional encoding scheme to encode the question. Then, it utilizes a dynamic memory network with an attention mechanism to generate the answer based on this information. This project is implemented using the Tensorflow library, and allows end-to-end training of both CNN and RNN parts.

Prerequisites

Tensorflow (instructions)
NumPy (instructions)
OpenCV (instructions)
Natural Language Toolkit (NLTK) (instructions)
Pandas (instructions)
Matplotlib (instructions)
tqdm (instructions)

Usage

Preparation: Download the COCO train2014 and val2014 images here. Put the COCO train2014 images in the folder train/images, and put the COCO val2014 images in the folder val/images. Download the VQA v1 training and validation questions and annotations here. Put the file mscoco_train2014_annotations.json and OpenEnded_mscoco_train2014_questions.json in the folder train. Similarly, put the file mscoco_val2014_annotations.json and OpenEnded_mscoco_val2014_questions.json in the folder val. Furthermore, download the pretrained VGG16 net here or ResNet50 net here if you want to use it to initialize the CNN part.
Training: To train a model using the VQA v1 training data, first setup various parameters in the file config.py and then run a command like this:

python main.py --phase=train \
    --load_cnn \
    --cnn_model_file='./vgg16_no_fc.npy'\
    [--train_cnn]

Turn on --train_cnn if you want to jointly train the CNN and RNN parts. Otherwise, only the RNN part is trained. The checkpoints will be saved in the folder models. If you want to resume the training from a checkpoint, run a command like this:

python main.py --phase=train \
    --load \
    --model_file='./models/xxxxxx.npy'\
    [--train_cnn]

To monitor the progress of training, run the following command:

tensorboard --logdir='./summary/'

Evaluation: To evaluate a trained model using the VQA v1 validation data, run a command like this:

python main.py --phase=eval --model_file='./models/xxxxxx.npy'

The result will be shown in stdout. Furthermore, the generated answers will be saved in the file val/results.json.

Inference: You can use the trained model to answer any questions about any JPEG images! Put such images in the folder test/images. Also, create a CSV file containing your questions (this file should have three fields: image, question, question_id), and put it in the folder test. Then run a command like this:

python main.py --phase=test --model_file='./models/xxxxxx.npy'

The generated answers will be saved in the folder test/results.

Results

A pretrained model with default configuration can be downloaded here. This model was trained solely on the VQA v1 training data. It achieves accuracy 60.35% on the VQA v1 validation data. Here are some successful examples:

References

Dynamic Memory Networks for Visual and Textual Question Answering Caiming Xiong, Stephen Merity, Richard Socher. ICML 2016.
Visual Question Answering (VQA) dataset
Implementing Dynamic memory networks by YerevaNN
Dynamic memory networks in Theano
Dynamic Memory Networks in Tensorflow

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
examples		examples
models		models
summary		summary
test		test
train		train
utils		utils
val		val
LICENSE.md		LICENSE.md
README.md		README.md
base_model.py		base_model.py
config.py		config.py
dataset.py		dataset.py
episodic_memory.py		episodic_memory.py
eval.sh		eval.sh
main.py		main.py
model.py		model.py

License

DeepRNN/visual_question_answering

Folders and files

Latest commit

History

Repository files navigation

Introduction

Prerequisites

Usage

Results

References

About

Resources

License

Stars

Watchers

Forks

Languages