Skip to content

[Python 3] Tensorflow implementation of "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention"

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

Cheng-Lin-Li/show-attend-and-tell

Repository files navigation

Python 3 Version of Show, Attend and Tell using Tensorflow

This repo is based on Python3 version of DeepRNN/image_captioning, which implements "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" by Xu et al. (ICML2015). Many thanks to salaniz's coco evaluation tool for python3.

This version revised as below:

  • Python 3.6

  • Tensorflow 1.7.0

  • Support windows platform

  • Support distributed computing on Clusterone platform (Not yet finished)

  • Add ./data folder to collect all data related folders.

    • Below folders move to under ./data folder
      1. models
      2. train
      3. val
      4. test
      5. pre-trained models (vgg16 & resnet50)
  • Fix COCO library to support python 3.6 and below new parameters

  • Add new parameters in config.py

      # size of COCO dataset
      # Remove below two setting if train on whole coco datasets
      # remark this line to train on whole data
      self.max_train_ann_num = 1000
    
      # remark this line to eval on whole data.
      self.max_eval_ann_num = 20
    

Original readme below

Introduction

This neural system for image captioning is roughly based on the paper "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" by Xu et al. (ICML2015). The input is an image, and the output is a sentence describing the content of the image. It uses a convolutional neural network to extract visual features from the image, and uses a LSTM recurrent neural network to decode these features into a sentence. A soft attention mechanism is incorporated to improve the quality of the caption. This project is implemented using the Tensorflow library, and allows end-to-end training of both CNN and RNN parts.

Prerequisites

Usage

  • Preparation: Download the COCO train2014 and val2014 data here. Put the COCO train2014 images in the folder train/images, and put the file captions_train2014.json in the folder train. Similarly, put the COCO val2014 images in the folder val/images, and put the file captions_val2014.json in the folder val. Furthermore, download the pretrained VGG16 net here or ResNet50 net here if you want to use it to initialize the CNN part.

  • Training: To train a model using the COCO train2014 data, first setup various parameters in the file config.py and then run a command like this:

python main.py --phase=train \
    --load_cnn \
    --cnn_model_file=./data/vgg16_no_fc.npy \
    [--train_cnn]    

Turn on --train_cnn if you want to jointly train the CNN and RNN parts. Otherwise, only the RNN part is trained. The checkpoints will be saved in the folder models. If you want to resume the training from a checkpoint, run a command like this:

python main.py --phase=train \
    --load \
    --model_file=./data/models/xxxxxx.npy \
    [--train_cnn]

To monitor the progress of training, run the following command:

tensorboard --logdir=./summary/
  • Evaluation: To evaluate a trained model using the COCO val2014 data, run a command like this:
python main.py --phase=eval \
    --model_file=./data/models/xxxxxx.npy \
    --beam_size=3

The result will be shown in stdout. Furthermore, the generated captions will be saved in the file val/results.json.

  • Inference: You can use the trained model to generate captions for any JPEG images! Put such images in the folder test/images, and run a command like this:
python main.py --phase=test \
    --model_file=./data/models/xxxxxx.npy \
    --beam_size=3

The generated captions will be saved in the folder test/results.

Results

A pretrained model with default configuration can be downloaded here. This model was trained solely on the COCO train2014 data. It achieves the following BLEU scores on the COCO val2014 data (with beam size=3):

  • BLEU-1 = 70.3%
  • BLEU-2 = 53.6%
  • BLEU-3 = 39.8%
  • BLEU-4 = 29.5%

Here are some captions generated by this model: examples

References

About

[Python 3] Tensorflow implementation of "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention"

Topics

Resources

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published