Self-critical Sequence Training for Image Captioning (+ misc.)

This repository includes the unofficial implementation Self-critical Sequence Training for Image Captioning and Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.

The author of SCST helped me a lot when I tried to replicate the result. Great thanks. The att2in2 model can achieve more than 1.20 Cider score on Karpathy's test split (with self-critical training, bottom-up feature, large rnn hidden size, without ensemble)

This is based on my ImageCaptioning.pytorch repository. The modifications is:

Self critical training.
Bottom up feature support from ref. (Evaluation on arbitrary images is not supported.)
Ensemble
Multi-GPU training
Add transformer (merged from Transformer_captioning)

Requirements

Python 2.7 (because there is no coco-caption version for python 3) PyTorch 0.4 (along with torchvision) cider (already been added as a submodule)

(Skip if you are using bottom-up feature): If you want to use resnet to extract image features, you need to download pretrained resnet model for both training and evaluation. The models can be downloaded from here, and should be placed in data/imagenet_weights.

Pretrained models (using resnet101 feature)

Pretrained models are provided here. And the performances of each model will be maintained in this issue.

If you want to do evaluation only, you can then follow this section after downloading the pretrained models (and also the pretrained resnet101).

Train your own network on COCO

Download COCO captions and preprocess them

Download preprocessed coco captions from link from Karpathy's homepage. Extract dataset_coco.json from the zip file and copy it in to data/. This file provides preprocessed captions and also standard train-val-test splits.

Then do:

$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk

prepro_labels.py will map all words that occur <= 5 times to a special UNK token, and create a vocabulary for all the remaining words. The image information and vocabulary are dumped into data/cocotalk.json and discretized caption data are dumped into data/cocotalk_label.h5.

Download COCO dataset and pre-extract the image features (Skip if you are using bottom-up feature)

Download the coco images from link. We need 2014 training images and 2014 val. images. You should put the train2014/ and val2014/ in the same directory, denoted as $IMAGE_ROOT.

Then:

$ python scripts/prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root $IMAGE_ROOT

prepro_feats.py extract the resnet101 features (both fc feature and last conv feature) of each image. The features are saved in data/cocotalk_fc and data/cocotalk_att, and resulting files are about 200GB.

(Check the prepro scripts for more options, like other resnet models or other attention sizes.)

Warning: the prepro script will fail with the default MSCOCO data because one of their images is corrupted. See this issue for the fix, it involves manually replacing one image in the dataset.

Download Bottom-up features (Skip if you are using resnet features)

Download pre-extracted feature from link. You can either download adaptive one or fixed one.

For example:

mkdir data/bu_data; cd data/bu_data
wget https://storage.googleapis.com/bottom-up-attention/trainval.zip
unzip trainval.zip

Then:

python script/make_bu_data.py --output_dir data/cocobu

This will create data/cocobu_fc, data/cocobu_att and data/cocobu_box. If you want to use bottom-up feature, you can just follow the following steps and replace all cocotalk with cocobu.

Start training

$ python train.py --id fc --caption_model fc --input_json data/cocotalk.json --input_fc_dir data/cocotalk_fc --input_att_dir data/cocotalk_att --input_label_h5 data/cocotalk_label.h5 --batch_size 10 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log_fc --save_checkpoint_every 6000 --val_images_use 5000 --max_epochs 30

The train script will dump checkpoints into the folder specified by --checkpoint_path (default = save/). We only save the best-performing checkpoint on validation and the latest checkpoint to save disk space.

To resume training, you can specify --start_from option to be the path saving infos.pkl and model.pth (usually you could just set --start_from and --checkpoint_path to be the same).

If you have tensorflow, the loss histories are automatically dumped into --checkpoint_path, and can be visualized using tensorboard.

The current command use scheduled sampling, you can also set scheduled_sampling_start to -1 to turn off scheduled sampling.

If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross entropy loss, use --language_eval 1 option, but don't forget to download the coco-caption code into coco-caption directory.

For more options, see opts.py.

A few notes on training. To give you an idea, with the default settings one epoch of MS COCO images is about 11000 iterations. After 1 epoch of training results in validation loss ~2.5 and CIDEr score of ~0.68. By iteration 60,000 CIDEr climbs up to about ~0.84 (validation loss at about 2.4 (under scheduled sampling)).

Train using self critical

First you should preprocess the dataset and get the cache for calculating cider score:

$ python scripts/prepro_ngrams.py --input_json .../dataset_coco.json --dict_json data/cocotalk.json --output_pkl data/coco-train --split train

Then, copy the model from the pretrained model using cross entropy. (It's not mandatory to copy the model, just for back-up)

$ bash scripts/copy_model.sh fc fc_rl

Then

$ python train.py --id fc_rl --caption_model fc --input_json data/cocotalk.json --input_fc_dir data/cocotalk_fc --input_att_dir data/cocotalk_att --input_label_h5 data/cocotalk_label.h5 --batch_size 10 --learning_rate 5e-5 --start_from log_fc_rl --checkpoint_path log_fc_rl --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --self_critical_after 30

You will see a huge boost on Cider score, : ).

A few notes on training. Starting self-critical training after 30 epochs, the CIDEr score goes up to 1.05 after 600k iterations (including the 30 epochs pertraining).

Caption images after training

Generate image captions

Evaluate on raw images

Now place all your images of interest into a folder, e.g. blah, and run the eval script:

$ python eval.py --model model.pth --infos_path infos.pkl --image_folder blah --num_images 10

This tells the eval script to run up to 10 images from the given folder. If you have a big GPU you can speed up the evaluation by increasing batch_size. Use --num_images -1 to process all images. The eval script will create an vis.json file inside the vis folder, which can then be visualized with the provided HTML interface:

$ cd vis
$ python -m SimpleHTTPServer

Now visit localhost:8000 in your browser and you should see your predicted captions.

Evaluate on Karpathy's test split

$ python eval.py --dump_images 0 --num_images 5000 --model model.pth --infos_path infos.pkl --language_eval 1

The defualt split to evaluate is test. The default inference method is greedy decoding (--sample_max 1), to sample from the posterior, set --sample_max 0.

Beam Search. Beam search can increase the performance of the search for greedy decoding sequence by ~5%. However, this is a little more expensive. To turn on the beam search, use --beam_size N, N should be greater than 1.

Miscellanea

Using cpu. The code is currently defaultly using gpu; there is even no option for switching. If someone highly needs a cpu model, please open an issue; I can potentially create a cpu checkpoint and modify the eval.py to run the model on cpu. However, there's no point using cpu to train the model.

Train on other dataset. It should be trivial to port if you can create a file like dataset_coco.json for your own dataset.

Live demo. Not supported now. Welcome pull request.

For more advanced features:

Checkout ADVANCED.md.

Reference

If you find this repo useful, please consider citing (no obligation at all):

@article{luo2018discriminability,
  title={Discriminability objective for training descriptive captions},
  author={Luo, Ruotian and Price, Brian and Cohen, Scott and Shakhnarovich, Gregory},
  journal={arXiv preprint arXiv:1803.04376},
  year={2018}
}

Of course, please cite the original paper of models you are using (You can find references in the model files).

Acknowledgements

Thanks the original neuraltalk2 and awesome PyTorch team.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Self-critical Sequence Training for Image Captioning (+ misc.)

Requirements

Pretrained models (using resnet101 feature)

Train your own network on COCO

Download COCO captions and preprocess them

Download COCO dataset and pre-extract the image features (Skip if you are using bottom-up feature)

Download Bottom-up features (Skip if you are using resnet features)

Start training

Train using self critical

Caption images after training

Generate image captions

Evaluate on raw images

Evaluate on Karpathy's test split

Miscellanea

For more advanced features:

Reference

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Self-critical Sequence Training for Image Captioning (+ misc.)

Requirements

Pretrained models (using resnet101 feature)

Train your own network on COCO

Download COCO captions and preprocess them

Download COCO dataset and pre-extract the image features (Skip if you are using bottom-up feature)

Download Bottom-up features (Skip if you are using resnet features)

Start training

Train using self critical

Caption images after training

Generate image captions

Evaluate on raw images

Evaluate on Karpathy's test split

Miscellanea

For more advanced features:

Reference

Acknowledgements