Neural Image Caption Generator

This is a Deep Learning Model for generating captions for images. It uses techniques from Computer Vision and Natural Language Processing. Some handpicked examples of images from test dataset and the captions generated by the model are shown below.

Introduction

Deep Learning and Neural Networks have found profound applications in both NLP and Computer Vision. Before the Deep Learning era, statistical and Machine Learning techniques were commonly used for these tasks, especially in NLP. Neural Networks however have now proven to be powerful techniques, especially for more complex tasks. With the increase in size of available datasets and efficient computational tools, Deep Learning is being throughly researched on and applied in an increasing number of areas.
In 2012 the Google ImageNet Challenge (ILSVRC) results showed that Convolutional Neural Networks (CNNs) can be an excellent choice for tasks involving visual imagery. Being translation invariant, after learning a pattern in one region of an image, CNNs can very easily recognize it in another region - a task which was quite computationally inefficient in vanilla feed-forward networks. When many convolutional layers are stacked together, they can efficiently learn to recognize patterns in a hierarchical manner - the initial layers learn to detect edges, lines etc. while the later layers make use of these to learn more complex features. In this project, we make use of a popular CNN architecture - the ResNet50 to process the input images and get the feature vectors.
For generating the captions, we make use of Long Short-Term Memory (LSTM) networks. LSTMs are a variant of Recurrent Neural Networks which are widely used in Natural Language Processing. Unlike a Dense layer, an RNN layer does not process an input in one go. Instead, it processes a sequence element-by-element, at each step incorporating new data with the information processed so far. This property of an RNN makes it a natural yet powerful architecture for processing sequential inputs.

Dataset

This project uses the Flickr 8K dataset for training the model. This can be downloaded from here. It contains 8000 images, most of them featuring people and animals in a state of action. Each image is provided with five different captions describing the entities and events depicted in the image. Different captions of the same image tend to focus on different aspects of the scene, or use different linguistic constructions. This ensures that there is enough linguistic variety in the description of the images.
Some sample examples of images from the dataset, and their captions are given below-

Model

This project uses the ResNet50 architecture for obtaining the image features. ResNets (short for Residual Networks) have been classic approach for many Computer Vision tasks, after this network won the 2015 ImageNet Challenge. ResNets showed how even very Deep Neural Networks (the original ResNet was around 152 layers deep!) can be trained without worrying about the vanishing gradient problem. The strength of a ResNet lies in the use of Skip Connections - these mitigate the vanishing gradient problem by providing a shorter alternate path for the gradient to flow through.
ResNet50 which is used in this project is a smaller version of the original ResNet152. This architecture is so frequently used for Transfer Learning that it comes preloaded in the Keras framework, along with the weights (trained on the ImageNet dataset). Since we only need this network for getting the image feature vectors, so we remove the last layer (which in the original model was used to classify input image into one of the 1000 classes). The encoded features for training and test images are stored at "encoded_train_features.pkl" and "encoded_test_features.pkl" respectively.

A plot of the architecture

GloVe vectors were used for creating the word embeddings for the captions. The version used in this project contains 50-dimensional embedding vectors for 6 Billion English words. It can be downloaded from here. These Embeddings are not processed (fine-tuned using the current data) further during training time.
The neural network for generating the captions has been built using the Keras Functional API. The features vectors (obtained form the ResNet50 network) are processed and combined with the caption data (which after converting into Embeddings, have been passed through an LSTM layer). This combined information is passed through a Dense layer followed by a Softmax layer (over the vocabulary words). The model was trained for 20 epochs, and at the end of each epoch, the model was saved in the "/model_checkpoints" directory. This process took about half an hour.

Frameworks, Libraries & Languages

Keras
Tensorflow
Python3
Numpy
Matplotlib
pickle-mixin

Usage

On the terminal run the following commands-

Install all dependencies
pip install python3
pip install numpy
pip install matplotlib
pip install pickle-mixin
pip install tensorflow
pip install keras
Clone this repository on your system and head over to it
git clone https://github.com/matakshay/Neural_Image_Caption_Generator
cd Neural_Image_Caption_Generator
To run the model over a random image from the test dataset and see the caption, execute the following command-
python3 predict.py
This command can be executed multiple times and each time a random image and its caption (generated by the model) will be displayed
To run the model and generate caption for a custom image, move the image (in JPEG format) to the current directory and rename it to "input.jpg". Then type the following on the terminal-
python3 generate.py
This loads the model and runs it with the input image. The generated caption will be printed.

Acknowledgement

I referred many articles & research papers while working on this project. Some of them are listed below-

https://www.ijcai.org/Proceedings/15/Papers/593.pdf
https://arxiv.org/abs/1411.4555v2
https://en.wikipedia.org/wiki/Convolutional_neural_network
Deep Learning with Python by Fraçois Chollet
https://towardsdatascience.com/understanding-and-coding-a-resnet-in-keras-446d7ff84d33
https://arxiv.org/abs/1512.03385

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
data		data
demo		demo
examples		examples
model_checkpoints		model_checkpoints
.gitignore		.gitignore
README.md		README.md
build.ipynb		build.ipynb
encoded_test_features.pkl		encoded_test_features.pkl
encoded_train_features.pkl		encoded_train_features.pkl
generate.py		generate.py
idx_to_word.pkl		idx_to_word.pkl
model_plot.png		model_plot.png
predict.py		predict.py
text_preprocessing.ipynb		text_preprocessing.ipynb
word_to_idx.pkl		word_to_idx.pkl

matakshay/Neural_Image_Caption_Generator

Folders and files

Latest commit

History

Repository files navigation

Neural Image Caption Generator

TABLE OF CONTENTS

Introduction

Dataset

Model

Frameworks, Libraries & Languages

Usage

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Languages