Image Captioning

PyTorch re-implementation of some image captioning models.

Supported Models

show_tell

Show and Tell: A Neural Image Caption Generator. Oriol Vinyals, et al. CVPR 2015. [Paper] [Code]
att2all

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu, et al. ICML 2015. [Paper] [Code]
adaptive_att & spatial_att

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. Jiasen Lu, et al. CVPR 2017. [Paper] [Code]

You can train different models by editing caption_model item in config.py.

Requirements

First, make sure your environment is installed with:

Python >= 3.5
java 1.8.0 (for computing METEOR)

Then install requirements:

pip install -r requirements.txt

Dataset

For dataset, I use Flicker30k and Karpathy's split. It is also okey to use Flickr8k or MSCOCO 2014 (their splits and captions are also contained in Karpathy's split). If you want to use other datasets, you may have to create a JSON file which looks like Karpathy's JSON.

Usage

Configuration

Edit options and hyper parameters in config.py. Refer to this file for more information about each item.

Preprocess

First of all, you should preprocess the images along with their captions and store them locally:

python preprocess.py

Pre-trained Word Embeddings

If you would like to use pre-trained word embeddings (like GloVe), just set embed_pretrain to True and specify the path to pre-trained vectors (embed_path ) in config.py. You could also choose to fine-tune word embeddings or not with by editing fine_tune_embeddings item.

Or if you want to randomly initialize the embedding layer's weights, set embed_pretrain to False and specify the embedding size (embed_dim).

Train

To train a model, just run:

python train.py

If you have enabled tensorboard (tensorboard=True in config.py), you can visualize the losses and accuracies during training by:

tensorboard --logdir=<your_log_dir>

Test

To test a checkpoint on test set and compute evaluation metrics:

python test.py

Now BLEU, CIDEr, METEOR and ROUGE-L are supported. Implementations of these metrics are under metrics folder.

During training stage, the BLEU-4 and CIDEr scores on validation set would be computed after each epoch's validation. However, since the decoder's input at each timestep is the word in ground truth captions, but not the word it generated in the previous timestep (Teacher Forcing), such scores does not reflect the real performance. So you could also consider about using this script to compute the correct scores for a specific trained model on validation set.

Inference

To generate a caption (and visualize the attention weights if the model use an attention module) on a specific image:

First edit the following items in inference.py:

model_path = 'path_to_trained_model'
wordmap_path = 'path_to_word_map'
img = 'path_to_image'
beam_size = 5  # beam size for beam search

Then run:

python inference.py

Notes

The load_embeddings method (in utils/embedding.py) would try to create a cache for loaded embeddings under folder dataset_output_path. This dramatically speeds up the loading time the next time.

Results

Here are some examples of the captions generated on images in test set.

I haven't fine-tuned CNN. You'd probably want to try fine-tuning it to get better results.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
metrics		metrics
models		models
trainer		trainer
utils		utils
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
inference.py		inference.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

License

Renovamen/Image-Captioning

Folders and files

Latest commit

History

Repository files navigation

Image Captioning

Supported Models

Requirements

Dataset

Usage

Configuration

Preprocess

Pre-trained Word Embeddings

Train

Test

Inference

Notes

Results

Adaptive Attention

Good Results

Okey Results

Bad Results

Attention

Good Results

Okey Results

Bad Results

License

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages