Skip to content

Renovamen/Image-Captioning

Repository files navigation

Image Captioning

PyTorch re-implementation of some image captioning models.

 

Supported Models

  • show_tell

    Show and Tell: A Neural Image Caption Generator. Oriol Vinyals, et al. CVPR 2015. [Paper] [Code]

  • att2all

    Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu, et al. ICML 2015. [Paper] [Code]

  • adaptive_att & spatial_att

    Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. Jiasen Lu, et al. CVPR 2017. [Paper] [Code]

You can train different models by editing caption_model item in config.py.

 

Requirements

First, make sure your environment is installed with:

  • Python >= 3.5
  • java 1.8.0 (for computing METEOR)

Then install requirements:

pip install -r requirements.txt

 

Dataset

For dataset, I use Flicker30k and Karpathy's split. It is also okey to use Flickr8k or MSCOCO 2014 (their splits and captions are also contained in Karpathy's split). If you want to use other datasets, you may have to create a JSON file which looks like Karpathy's JSON.

 

Usage

Configuration

Edit options and hyper parameters in config.py. Refer to this file for more information about each item.

Preprocess

First of all, you should preprocess the images along with their captions and store them locally:

python preprocess.py

Pre-trained Word Embeddings

If you would like to use pre-trained word embeddings (like GloVe), just set embed_pretrain to True and specify the path to pre-trained vectors (embed_path ) in config.py. You could also choose to fine-tune word embeddings or not with by editing fine_tune_embeddings item.

Or if you want to randomly initialize the embedding layer's weights, set embed_pretrain to False and specify the embedding size (embed_dim).

Train

To train a model, just run:

python train.py

If you have enabled tensorboard (tensorboard=True in config.py), you can visualize the losses and accuracies during training by:

tensorboard --logdir=<your_log_dir>

Test

To test a checkpoint on test set and compute evaluation metrics:

python test.py

Now BLEU, CIDEr, METEOR and ROUGE-L are supported. Implementations of these metrics are under metrics folder.

During training stage, the BLEU-4 and CIDEr scores on validation set would be computed after each epoch's validation. However, since the decoder's input at each timestep is the word in ground truth captions, but not the word it generated in the previous timestep (Teacher Forcing), such scores does not reflect the real performance. So you could also consider about using this script to compute the correct scores for a specific trained model on validation set.

Inference

To generate a caption (and visualize the attention weights if the model use an attention module) on a specific image:

First edit the following items in inference.py:

model_path = 'path_to_trained_model'
wordmap_path = 'path_to_word_map'
img = 'path_to_image'
beam_size = 5  # beam size for beam search

Then run:

python inference.py

 

Notes

  • The load_embeddings method (in utils/embedding.py) would try to create a cache for loaded embeddings under folder dataset_output_path. This dramatically speeds up the loading time the next time.

 

Results

Here are some examples of the captions generated on images in test set.

I haven't fine-tuned CNN. You'd probably want to try fine-tuning it to get better results.

Adaptive Attention

Good Results

adaptive-2

Okey Results

adaptive-1Errors: two boys, not a chair...

Bad Results

adaptive-2

Error: not crying...

Attention

Good Results

attention-1 attention-2

Okey Results

attention-3

Bad Results

attention-3

Errors: not a woman, and seems to recognize sleeves as jeans...

 

License

MIT

 

Acknowledgements

Releases

No releases published

Packages

No packages published

Languages