Skip to content

Mehrdad93/Image-captioning-with-RNN-based-attention

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image Captioning with RNN-based Attention License: MIT

We introduce an attention based model that automatically learns to generate a caption for images. Our model consists of a novel attention module which includes an elegant modification of GRU architecture. We validate the use of our attention model on a benchmark dataset MS COCO (2017), and compare its performance with other sate-of-the-art models. Our proposed model has the BLEU-1 score of 74.0.

Example captions generated by our model

Our Contributions and Proposed Model

Our proposed attention model for image captioning, consisting of CNN, Attention Module, and LSTM. The attention module has two main components. A MLP to compute the attention weights and a attention GRU module which aims to provide a contextual representation that allows logical reasoning over interesting regions.

Attention GRU Module

Inspired by C. Xiong et al., we want the attention mechanism to take into account both position and ordering of the input regions. An RNN would be advantageous in this situation except they cannot make use of the attention wights.

In the figure below, you can ifnd the difference between (a) the traditional GRU, and (b) the proposed attention-based GRU model in this work:

Experiments

Dataset

We performed experiments on MS COCO dataset, contains complex day-to-day scenes of common objects in their natural context. The dataset contains 82,783 training images, 40,504 validation images, and 40,775 test images. Each image is annotated with 5 sentences using Amazon Mechanical Turk. Since, there is an ongoing competition on this dataset, annotation for test dataset is not available. To train our model, we have used both training and validation sets. To test the proposed model, we have hold out 5000 samples of the validation set. The same split is used for all the experiments. We did not use the test set for evaluation since there is a limited number of submissions available per day.

Metric

We use METEOR and BLEU as evaluation metrics, which are popular in the machine translation literature and used in recent image caption generation papers. The BLEU score is based on n-gram precision of the generated caption with respect to the references. The METEOR is based on the harmonic mean of uni-gram precision and recall, and produces a good correlation with human judgment.

Implementation Details

The models are implemented with Tensorflow and are trained using the RMSprop optimizer for 100 epochs with batch size 40 and learning rate 0:0001. For fairness, we re-implemented two baselines models in and trained them and our proposed model with the same setting. The CNN used in all the models is ResNet150 and we used the pre-trained model on ImageNet for initializing the weights.

Results

BLEU-1,2,3,4/METEOR metrics compared to other methods on MS COCO dataset. Models with * are trained on both train set and validation set.

Conclusion

We have presented a new attention mechanism for image caption generation by introducing ATTN GRU (a modified version of traditional GRU). Unlike soft-attention mechanism, our attention model preserves the spatial information as well as the order of the regions in the image. Experimental results on MS COCO dataset shows the effectiveness of our model in image captioning task.

Contributors

Mehrdad Mokhtari; Akbar Rafiey; Hamid Homapour; Faezeh Bayat

More details can be found in the following file:

Image_Captioning_with_GRU_based_Attention.pdf

Releases

No releases published

Packages

No packages published

Languages