Image-captioning

Overview

In this project, a CNN-LSTM encoder-decoder model was used to generate captions for images automatically. A complex deep learning model is used comprising of two components: a Convolutional Neural Network (CNN) that transforms an input image into a set of features - encoding the information in an image into a vector (known as embedding, which is a featurized representation of the image), and an RNN that turns those features into descriptive language.

File description

Notebook 0: Initializes the COCO API (the "pycocotools" library), used to access data from the MS COCO (Common Objects in Context) dataset.
Notebook 1: uses the pycocotools, torchvision transforms, and NLTK to preprocess the images and the captions for network training. It also explores the EncoderCNN and DecoderRNN.
model.py: Architecture and forward() functions for encoder and decoder.
Notebook 2: Selection of hyperparameter values and training. The hyperparameter selections are explained.
Notebook 3: Using the trained model to generate captions for images in the test dataset.
data_loader.py: Custom data loader for PyTorch combining the dataset and the sampler.
vocabulary.py: Vocabulary constructor built from the captions in the training dataset.

Encoder-Decoder model

This is the complete architecture of the image captioning model. The CNN encoder basically finds patterns in images and encodes it into a vector that is passed to the LSTM decoder that outputs a word at each time step to best describe the image. Upon reaching the <end> token, the entire caption is generated and that is our output for that particular image.

Encoder:

The encoder is based on a Convolutional Neural Network that encodes an image into a featurized compact representation (in the form of an embedding). The CNN-Encoder is a ResNet (Residual Network). These kind of networks help in diminishing the vanishing and exploding gradient problems. I used the ResNet-152 pre-trained model.

Decoder:

The CNN encoder is followed by a recurrent neural network (LSTM) that generates a corresponding sentence.
The feature vector is fed into the "DecoderRNN" (which is "unfolded" in time). Each word appearing as output at the top is fed back to the network as input (at the bottom) in a subsequent time step, until the entire caption is generated. The arrow pointing to the right that connects the LSTM boxes together represents hidden state information, which represents the network's "memory", also fed back to the LSTM at each time step.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
images		images
0_Dataset.ipynb		0_Dataset.ipynb
1_Preliminaries.ipynb		1_Preliminaries.ipynb
2_Training.html		2_Training.html
2_Training.ipynb		2_Training.ipynb
3_Inference.html		3_Inference.html
3_Inference.ipynb		3_Inference.ipynb
LICENSE		LICENSE
README.md		README.md
data_loader.py		data_loader.py
filelist.txt		filelist.txt
model.py		model.py
training_log.txt		training_log.txt
vocab.pkl		vocab.pkl
vocabulary.py		vocabulary.py

License

siddsriv/Image-captioning

Folders and files

Latest commit

History

Repository files navigation

Image-captioning

Overview

File description

Encoder-Decoder model

Encoder:

Decoder:

Results:

Funny fails:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages