Skip to content

An unofficial Torch implementation of J. Lu, C. Xiong, et al., Knowing when to Look: Adaptive Attention via a Visual Sentinel for Image Captioning, 2017 with deformable adaptive attention

License

Notifications You must be signed in to change notification settings

DiTo97/dense-image-captioning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dense image captioning

An unofficial Torch implementation of J. Lu, C. Xiong, et al., Knowing when to Look: Adaptive Attention via a Visual Sentinel for Image Captioning, 2017 trained on the COCO image captioning and Flickr30k datasets.

The implementation presents the following variations from the paper:

  • deformable adaptive attention;
  • larger visual sentinel size (128-dim);
  • model eval against the SPICE metric;
  • MCTS-based decoding.

Introduction

The role of image dense captioning is immense for enabling visual-language understanding of the outer world.

In this project we propose a deformable variant of the visual sentinel via adaptive attention introduced in the reference paper for estimating grounding probas which allows larger networks to be constructed while running at a faster inference speed and training for almost half the epochs with equal performance.

This project is part of a larger venture for the development of visual-language aid tools for visually-impaired people, by combining speech recognition, speech synthesis, image captioning and familiar person identification.

For more information, see the attached in-depth report.

Training

The model was trained for 50 epochs on a multi-GPU HPC cluster courtesy of CERN.

Usage

The following files must be downloaded from Google Drive:

The former contains the dataset with COCO-like annotations and the corresponding vocabulary.

The following files should be downloaded from Google Driver for display purposes:

N.B.: If the provided links are not longer available, contact the authors.

Authors

About

An unofficial Torch implementation of J. Lu, C. Xiong, et al., Knowing when to Look: Adaptive Attention via a Visual Sentinel for Image Captioning, 2017 with deformable adaptive attention

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published