Image feature generation

Nishant Prabhu, 25 July 2020

In this tutorial, we will learn to build a simple image captioning system - a model that can take in an image and generate sentence to describe it in the best possible way. This is a slightly advanced tutorial and requires that you understand some basics of natural language processing, the working and usage of pre-trained CNNs and recurrent network units like LSTMs. Don't worry if you don't; I'll provide brief explanations and links for further reading wherever I can.

System specifications

For this tutorial, most of magic is done with PyTorch. A few helper functions will be implemented with Keras. This should work on any version of these modules, although here are my specifications:

1. Tensorflow (for Keras)        tensorflow-gpu==2.12.0
2. PyTorch                       torch==1.5.1
3. Python                        3.6.9

Dataset downloads

For this tutorial, we are using the Flickr 8K dataset, which can be downloaded from Jason Brownlee's Github using the links on this page.

Special note

All file paths used below as specific to my system. Please change them accordingly if you are using the code as it is.

# Dependencies

import torch
import torch.optim as optim
import torch.nn.functional as F
from torchvision import models
from torchvision import transforms
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.translate.bleu_score import corpus_bleu

import os
import string
import pickle
import numpy as np 
from PIL import Image 
from matplotlib import pyplot as plt
from tqdm.notebook import tqdm

Image feature generation

The first thing we do is create rich features from images, which should be representative of its contents. It is best to use pre-trained deep CNNs for this task, sihttps://i.imgur.com/uYMAQXo.pngnce they are (usually) trained on very large datasets like ImageNet and can generate excellent features. I am going to use VGG-16 by PyTorch, which is VGGNet with 16 layers. The model's general architecture looks like the image below.

The final layer in this model is a fully connected layer mapping to 1000 units. This is kept for classification purposes (the ImageNet dataset has 1000 classes) and is not necessary for us. We'll retain only up to the penultimate layer of the model, which will give us a feature vector of 4096 dimensions for every image.

All images that go through the network must have the same shape. To do this, we define a transformation pipeline using transforms.Compose function from torch. This pipeline performs the following operations.

Resizes the image to (256, 256, 3).
Crops the image to only retain its center, which is of size (224, 224, 3). This assumes that the edges of the image do not contain any valuable information, which is reasonable most of the times.
Converts this PIL image into a torch tensor.
Normalizes each of the three channels of the image (RGB) with means (0.485, 0.456, 0.406) and standard deviations (0.229, 0.224, 0.225) respectively. These numbers were found to give optimal performance in ILSVRC (a very large computer vision competition), so we'll use it here as well.

# Transforms, to get the image into the right format
img_transform = transforms.Compose([
    transforms.Resize(size=256),
    transforms.CenterCrop(size=224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# Function to generate image features
def generate_image_features(image_dir):
    
    vgg = models.vgg16(pretrained=True)
    layers = list(vgg.children())[:-1]
    model = torch.nn.Sequential(*layers)
    
    features= {}
    
    for f in os.listdir(image_dir):
        path = image_dir + '/' + f
        image = Image.open(path)
        image = img_transform(image)
        image = image.unsqueeze(0)
        x = model(image)
        features.update({f: x})
        
    return features
    
# Generate the image features

image_features = generate_image_features("../data/Flickr8k_Dataset/images")

I already generated the features beforehand and stored them as a pkl file. I will load them a few cells down the line, when we need it.

Caption processing

Now that the image features are ready, let's shift our focus to the captions. Let's first have a look at the kind of data we have before we begin thinking about anything else.

# Open captions file and read all lines
# Split over newline tags to get a list of lines
with open("../data/captions.txt", "r") as f:
    captions = f.read().split("\n")
    
# Display the first 10 captions
captions[:10]

>>> ['1000268201_693b08cb0e.jpg#0\tA child in a pink dress is climbing up a set of stairs in an entry way .',
 '1000268201_693b08cb0e.jpg#1\tA girl going into a wooden building .',
 '1000268201_693b08cb0e.jpg#2\tA little girl climbing into a wooden playhouse .',
 '1000268201_693b08cb0e.jpg#3\tA little girl climbing the stairs to her playhouse .',
 '1000268201_693b08cb0e.jpg#4\tA little girl in a pink dress going into a wooden cabin .',
 '1001773457_577c3a7d70.jpg#0\tA black dog and a spotted dog are fighting',
 '1001773457_577c3a7d70.jpg#1\tA black dog and a tri-colored dog playing with each other on the road .',
 '1001773457_577c3a7d70.jpg#2\tA black dog and a white dog with brown spots are staring at each other in the street .',
 '1001773457_577c3a7d70.jpg#3\tTwo dogs of different breeds looking at each other on the road .',
 '1001773457_577c3a7d70.jpg#4\tTwo dogs on pavement moving toward each other .']

You can see that every image is associated with 5 captions (every image, because I checked). The first thing we might have to do is to organize this information in a more easier-to-work-with manner. A lookup table where the keys are the filename and a list of 5 captions associated with that file as its value should do. The Python equivalent of a lookup table is a dictionary, so we'll create a dictionary. To separate the keys from the sentences, we split each sentence on the tab delimiter "\t". Plus, we'll also remove the number tags (#0, #1, etc.) from the filenames.

# Create dictionary of captions 
captions_dict = {}

for line in captions:
    
    # Separate caption from filename
    contents = line.split("\t")
    
    # If there are no captions, then ignore
    if len(contents) < 2:
        continue
    
    filename, caption = contents[0], contents[1]
    
    # Exclude the number tag from filename
    filename = filename[:-2]
    
    # If the dictionary already has the filename in its keys ...
    # ... simply extend the list with current caption
    # Else, add the current caption in a list
    if filename in captions_dict.keys():
        captions_dict[filename].append(caption)
    else:
        captions_dict[filename] = [caption]

Let's split the image features and captions into training and test features.

# Create lists of filenames that go into training and testing data
# We will do a 70-30 split

split_point = int(0.7 * len(captions_dict))

training_files = list(captions_dict.keys())[:split_point]
test_files = list(captions_dict.keys())[split_point:]

# Segregate image features and captions for train and test into separate lists
train_features, train_captions = {}, {}
test_features, test_captions = {}, {}

for name in training_files:
    train_features[name] = image_features[name]
    train_captions[name] = captions_dict[name]
    
for name in testing_files:
    test_features[name] = image_features[name]
    test_captions[name] = captions_dict[name]

# I created these beforehand, so I'll just load them here

with open("../saved_data/train_features.pkl", "rb") as f:
    train_features = pickle.load(f)
    
with open("../saved_data/train_captions.pkl", "rb") as f:
    train_captions = pickle.load(f)
    
with open("../saved_data/test_features.pkl", "rb") as f:
    test_features = pickle.load(f)
    
with open("../saved_data/test_captions.pkl", "rb") as f:
    test_captions = pickle.load(f)

Now that we have it in the right form, we need to make some changes to the captions themselves. Machines do not understand textual information the way we do (similar to how we do not understand numbers like they do :P). We have to employ a method to convert each word into a number (or a set of numbers) which can indicate to the model the presence of certain words in a sentence.

Tokenization

A simple solution to this would be to generate a vocabulary of all the words we have (bag of words) and assign an integer to each word. So you have a number identifying every single word that can occur in the examples that you have. This process is called tokenization. Integers, however, are not the best way to provide information to neural networks. As to what we can do about it, we will see a little later.

Preprocessing

Some elements of caption sentences do not carry any meaning. Examples include uppercase alphabets, punctuation and numbers (at least for our purposes). First we will clean each sentence of these entities. Then, we will add two special tokens at the start and end of each sentence, to indicate the beggining and ending of the sentence (startseq and endseq respectively). These help the model know when to start and stop predicting. Once that is done, we tokenize all the lines using the Tokenizer class from keras.preprocessing.text submodule.

Padding

Neural networks expect inputs to be of fixed size when provided in a batch. Since captions might be of different lengths, we must pad each sentence at the end with special tokens (usually zeros) so that all inputs have the same shape. This process is known as padding. We will use the pad_sequences function from keras.preprocessing.sequence to perform this operation.

def preprocess_line(line):
    line = line.split()                             # Convert to a list of words
    line = [w.lower() for w in line]                # Convert to lowercase
    line = [w for w in line if w.isalpha()]         # Remove numbers
    line = " ".join(line).translate(
        str.maketrans("", "", string.punctuation)   # Remove punctuation
    )
    line = "startseq " + line + " endseq"
    return line


def get_all_captions(captions_dict):
    """ Collects all captions in one list """
    
    captions = []
    for cpt_list in captions_dict.values():
        captions.extend(cpt_list)
    return captions


def tokenize_sentences(lines):
    tokenizer = Tokenizer(filters='')           # Initialize the tokenizer object
    tokenizer.fit_on_texts(lines)               # This generates the word index map
    vocab_size = len(tokenizer.word_index)+1    # The size of the vocabulary
    return tokenizer, vocab_size


def pad_tokens(tokens):
    return pad_sequences(tokens, padding='post') # Max length is auto-calculated

Let's call these functions on our corpus to get clean text. We will fit a tokenizer on the lines now but we will perform tokenizer later.

# Clean all the captions 

for filename, cpt_list in captions_dict.items():
    for i in range(len(cpt_list)):
        # Clean the caption
        cleaned_caption = preprocess_line(cpt_list[i])
        # Replace it in the correct location in the list
        cpt_list[i] = cleaned_caption
        
        
# Get all captions in a list and fit tokenizer

all_captions = get_all_captions(captions_dict)
tokenizer, vocab_size = tokenize_sentences(all_captions)

# How many words do we have?
print("Size of the vocabulary:", vocab_size)

>>> Size of the vocabulary: 8372

Model definition

Here comes the heart of this tutorial. But before that, for those who are new, below is a short primer on recurrent neural networks. Also, we'll discuss the correct representation of words in this section. If you know what I'm talking about, skip to Model Structure below.

Recurrent neural networks

Simple fully connected networks assume that the examples provided to them are independent of each other. However, when we are dealing with sequences, this assumption falls apart. So it is not wise to use Dense layers to model such dependencies between consecutive sequence elements. This is where recurrent neural networks (RNNs) come to the rescue!

An RNN is capable of retaining some memory of the examples it has seen earlier by feeding back the activations of its hidden layer to itself (left image). We can unroll the RNN in time as shown in the image on the right. Each input layer + hidden layer + output in the right image is the same network at different times. Also note that the outputs taken from the RNN layer are usually its hidden layer activations.

The network itself behaves as a fully connected network, with the output given by the expression below. It incorporates the effect of the current input as well as the previous hidden layer activations, as a weighted sum.

Gradient issues and LSTMs

RNNs suffer from two issues, namely Vanishing and Exploding Gradients. This greatly affects their performance, making them rather unpopular in the practical space. Schmidhuber et. al. came up with a novel solution to this with their Long Short Term Memory (LSTM) cells, as a replacement for the simple RNN cell. It's structure is shown below.

We will not go into the details of what this is and how it works (you can read more about it in Christopher Olah's blog). All we need to know is that these cells have the capability to selectively read, forget and output information from inputs and past hidden states. They effectively solve the vanishing gradients problem; the exploding gradients - not so much. Anyway, they are better at modelling long term dependencies than RNNs.

I/O specifications for recurrent layers

I think it's important that we know the way a layer of RNNs or LSTMs accepts and outputs tensors. The shape of input tensors to these layers are of the form (batch size, sequence length, features). They output tensors in the very same format, only the number of features in the output will be equal to the number of hidden units of the recurrent layer.

Note for PyTorch: PyTorch has the default convention (sequence length, batch size, features) for its recurrent layers. Passing the argument batch_first = True brings it to the format above.

Embeddings and GloVe

Remember how we represented each word in our "vocabulary" as an integer? Integers are not the best way to capture information about words and their meanings. Think about it:

When you provide numbers to a network, it will treat its magnitude as a property of that word. Is there any basis for a word with index 1 to have smaller magnitude than a word with index 500?
Where there are numbers, you can compute distances. Would it make sense to have word 10 closer to 1 than word 20?

Since the allotment of integers to words is arbitrary (first-come-first-serve actually), it doesn't make sense to use them as is. One way to tackle this issue would be to represent each word as a one hot vector (binary vector with 1 at the word's index and 0 everywhere else). This, unfortunately, again has problems.

Our vocabulary is 8372 words strong. Each word will be a 8372 dimensional, extremely sparse vector. If you use these tensors made of these with float data type, that would be an astronomical waste of memory (considering you even get to store that much).
The distance issue is now different. Earlier, we had arbitrary distances between words. Now, each word is equidistant from every other word! That's equally unacceptable. Why? consider the words (NLP aficionado will cringe): king, queen, vase, Sunday. Don't you think "king" and "queen" should have less distance between them than with the other words? This is extremely important to understand relations between words from a corpus.

What else can we do? Well, how about we leave it to the network? We'll give it 8372 vectors which are randomly initialized (with smaller dimension, say a few hundred features) and let it learn the required spatial relations between words by modifying these vectors as it learns. Vectors generated through this process are called Word Embeddings. Enter GloVe!

GloVe stands for Global Vectors. Consider these vectors as a lookup table for our words. We go to this "learned" table, ask for the vector corresponsding to a word, and replace the word's integer token with this vector. GloVe is an excellent group of pre-trained vectors trained on very large corpora such as Common Crawl and Wikipedia, These vectors capture co-occurences of words really well, and we'll be using the same here. You can download pre-trained GloVe vectors here, generously made open-source by Stanford NLP group (for this tutorial, we are using the 6B vocab set (822 MB), trained on Wikipedia 2014).

Model structure

Finally! Here's what our model will look like (overall).

The inputs to the model are the preprocessed image (224, 244, 3) and the integer tokens of the captions. The image's features are extracted (we have already done this) and reduced to 256 dimensions using a Linear layer with ReLU activation. 256 is an arbitrary choice, feel free to try other dimensions. We will use Categorical Crossentropy loss (Log softmax + Nonlinear logloss in PyTorch) for updating the parameters. Also, we'll use an Adam optimizer with constant learning rate.

Providing the caption tokens, however, is not that straightforward. Imagine how YOU, as a human, would go about this.

Look at the image, extract some information and keep that in mind.
Generate the first word.
Look at the first word, remember what you saw in the image, and generate a second word so that the sentence so far is grammatically correct.
Look at the first and second words, and generate a third word. Refer to the image again if needed.
Continue this process until you think all of your words have been generated.

We will follow the same pattern for our network. So, our caption inputs will look like this.

Data Generator

I know this has gone too long, but here's one last thing before we start coding. Typically the datasets used for these will have very large sizes. Plus, for computing crossentropy, your predictions and targets will be one hot vectors of size 8372. If you try to pass the entire dataset in this form to your machine, it might run out of memory soon (and hang). To prevent this, we use a data generator.

A data generator is defined the same way a Python function is. The only difference is that it continuously yields (returns) batches of examples rather that returning a batch and falling silent. You can iterate over a generator object using the next() built-in function in Python. The generator starts running from the point it stopped last time whenever it is called - unlike a function which runs from the start each time.

We are first going to build the data generator for our model. Then, we will move on to building the model itself.

def data_generator(img_features, captions_dict, batch_size):
    
    # Initialize lists to store inputs and targets
    img_in, caption_in, caption_trg = [], [], []
    
    # Counter to check how many examples have been added
    count = 0
   
    # Run as and when called
    while True:
        for name, caption_list in captions_dict.items():
            
            # Get the relevant image features
            img_fs = img_features[name]
            
            for caption in caption_list:
                
                # Tokenize the sentence using the tokenizer
                # Remember to pass the string in a list 
                caption_seq = tokenizer.texts_to_sequences([caption])[0]
                
                for i in range(1, len(caption_seq)):
                    # Input is sequence until before i_th time step
                    # Target is the i_th time step
                    in_seq, trg_seq = caption_seq[:i], caption_seq[i]
                    
                    # Add these to the input/output lists
                    img_in.append(img_fs)
                    caption_in.append(in_seq)
                    caption_trg.append(trg_seq) # No need to one-hot encode this
                    count += 1
                
                    if count == batch_size:
                        # Pad the caption inputs since they all have different lengths
                        # Pre padding because if the sentence is too long, the LSTM ...
                        # ... might lose memory of important words that came long ago
                        caption_in = pad_sequences(caption_in, padding='pre')

                        # Yield these are torch tensors
                        yield (
                            torch.FloatTensor(img_in).squeeze(1),
                            torch.LongTensor(caption_in),
                            torch.LongTensor(caption_trg)
                        )

                        # I did .squeeze(1) for the image features because ...
                        # ... for me they have shape (batch size, 1, 4096) ...
                        # ... and I want to collapse the dimension at position 1
                        # So now it becomes (batch size, 4096)

                        # Reinitialize the lists
                        img_in, caption_in, caption_trg = [], [], []
                        
                        # Reset counter
                        count = 0

Let's test a sample of this generator and see the kind of data it outputs.

# Let's see if it's behaving how we want it to

gen = data_generator(train_features, train_captions, 32)

# Generate a batch by calling next() on the generator
img_in, caption_in, caption_trg = next(gen)

print("Image features:", img_in.shape)
print("Caption input:", caption_in.shape)
print("Caption target:", caption_trg.shape)

>>> Image features: torch.Size([32, 4096])
    Caption input: torch.Size([32, 15])
    Caption target: torch.Size([32])

Also, have a look at the contents of caption_in and caption_trg.

print(caption_in)

>>> tensor([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    2],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    2,   42],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            2,   42,    7],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    2,
           42,    7,  253],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    2,   42,
            7,  253,   10],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    2,   42,    7,
          253,   10,  104],
        [   0,    0,    0,    0,    0,    0,    0,    0,    2,   42,    7,  253,
           10,  104,   35],
        [   0,    0,    0,    0,    0,    0,    0,    2,   42,    7,  253,   10,
          104,   35,  799],
        [   0,    0,    0,    0,    0,    0,    2,   42,    7,  253,   10,  104,
           35,  799,   10],
        [   0,    0,    0,    0,    0,    2,   42,    7,  253,   10,  104,   35,
          799,   10,  234],
        [   0,    0,    0,    0,    2,   42,    7,  253,   10,  104,   35,  799,
           10,  234,  750],
        [   0,    0,    0,    2,   42,    7,  253,   10,  104,   35,  799,   10,
          234,  750, 1952],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    2],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    2,   19],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            2,   19,    4],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    2,
           19,    4,   30],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    2,   19,
            4,   30, 1787],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    2,   19,    4,
           30, 1787,  817],
        [   0,    0,    0,    0,    0,    0,    0,    0,    2,   19,    4,   30,
         1787,  817, 1006],
        [   0,    0,    0,    0,    0,    0,    0,    2,   19,    4,   30, 1787,
          817, 1006,    6],
        [   0,    0,    0,    0,    0,    0,    2,   19,    4,   30, 1787,  817,
         1006,    6,  585],
        [   0,    0,    0,    0,    0,    2,   19,    4,   30, 1787,  817, 1006,
            6,  585,   10],
        [   0,    0,    0,    0,    2,   19,    4,   30, 1787,  817, 1006,    6,
          585,   10,  750],
        [   0,    0,    0,    2,   19,    4,   30, 1787,  817, 1006,    6,  585,
           10,  750, 3199],
        [   0,    0,    2,   19,    4,   30, 1787,  817, 1006,    6,  585,   10,
          750, 3199,    4],
        [   0,    2,   19,    4,   30, 1787,  817, 1006,    6,  585,   10,  750,
         3199,    4,    5],
        [   2,   19,    4,   30, 1787,  817, 1006,    6,  585,   10,  750, 3199,
            4,    5,  102],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    2],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    2,   40],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            2,   40,   19],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    2,
           40,   19,  227],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    2,   40,
           19,  227,  817]])

print(caption_trg)

>>> tensor([  42,    7,  253,   10,  104,   35,  799,   10,  234,  750, 1952,    3,
          19,    4,   30, 1787,  817, 1006,    6,  585,   10,  750, 3199,    4,
           5,  102,    3,   40,   19,  227,  817,  680])

So it does provide us data in the staircase pre-padded format that we want. Perfect! Let's now start building our model. We will create a Network class that will inherit methods from torch.nn.Module. Loosely, this means all the updates, parameter initializations, backpropagation, etc. will be taken care of by torch itself. All we have to do is define what layers it will have, and how tensors will flow through them.

# Model definition

class Network(torch.nn.Module):

    def __init__(self, glove_weights):

        # Inherit methods from torch.nn.Module
        super(Network, self).__init__()

        # Define layers
        self.fc_img = torch.nn.Linear(4096, 512)               # To reduce image features to 256
        self.embedding = torch.nn.Embedding(vocab_size, 200)   # We will initialize this with GloVe
        self.lstm = torch.nn.LSTM(200, 512, batch_first=True)  # Generates hidden representations
        self.fc_wrapper = torch.nn.Linear(1024, 1024)          # For some more flexibility
        self.fc_output = torch.nn.Linear(1024, vocab_size)     # Gives the output probabilities

        # Initialize the embedding layer's weights with GloVe
        # Although they are pre-trained, we'll let them train again ...
        # ... so they can become more configured to our data
        self.embedding.weight = torch.nn.Parameter(glove_weights)


    def forward(self, img_in, caption_in):

        # Reduce the image features to 256 dimensions
        x1 = self.fc_img(img_in)              # Shape (batch_size, 256)
        x1 = F.relu(x1)

        # Generate embeddings for caption tokens
        x2 = self.embedding(caption_in)

        # Pass them through the LSTM layer and ...
        # ... preserve only the last output
        x2, _ = self.lstm(x2)             # Ignore the LSTM's hidden state
        x2 = x2[:, -1, :].squeeze(1)      # Shape: (batch_size, 256)

        # Concatenate x1 and x2 along the ...
        # ... features dimension (-1)
        x3 = torch.cat((x1, x2), dim=-1)  # Shape: (batch_size, 512)

        # Pass this through the output layer to get predictions
        x3 = self.fc_wrapper(x3)
        x3 = F.relu(x3)

        x3 = self.fc_output(x3)           # Shape: (batch_size, vocab_size)

        # Log softmax the outputs
        out = F.log_softmax(x3, dim=-1)

        return out

Processing GloVe embeddings

Before we test our model out, let's process the GloVe embeddings so we can initialize the embedding layer with its weights. We have a text file with several lines containing the word and its embeddings separated by newline tags "\n". We'll read them in an store them in a list to start with.

Note: This file is 690 MB+. Ensure you have enough RAM available when you're loading it.

# Read GloVe files

with open("../glove/glove.6B.200d.txt", "r") as f:
    glove = f.read().split("\n")
    
print(glove[0])

>>> 'the -0.071549 0.093459 0.023738 -0.090339 0.056123 0.32547 -0.39796 -0.092139 0.061181 -0.1895 0.13061 0.14349 0.011479 0.38158 0.5403 -0.14088 0.24315 0.23036 -0.55339 0.048154 0.45662 3.2338 0.020199 0.049019 -0.014132 0.076017 -0.11527 0.2006 -0.077657 0.24328 0.16368 -0.34118 -0.06607 0.10152 0.038232 -0.17668 -0.88153 -0.33895 -0.035481 -0.55095 -0.016899 -0.43982 0.039004 0.40447 -0.2588 0.64594 0.26641 0.28009 -0.024625 0.63302 -0.317 0.10271 0.30886 0.097792 -0.38227 0.086552 0.047075 0.23511 -0.32127 -0.28538 0.1667 -0.0049707 -0.62714 -0.24904 0.29713 0.14379 -0.12325 -0.058178 -0.001029 -0.082126 0.36935 -0.00058442 0.34286 0.28426 -0.068599 0.65747 -0.029087 0.16184 0.073672 -0.30343 0.095733 -0.5286 -0.22898 0.064079 0.015218 0.34921 -0.4396 -0.43983 0.77515 -0.87767 -0.087504 0.39598 0.62362 -0.26211 -0.30539 -0.022964 0.30567 0.06766 0.15383 -0.11211 -0.09154 0.082562 0.16897 -0.032952 -0.28775 -0.2232 -0.090426 1.2407 -0.18244 -0.0075219 -0.041388 -0.011083 0.078186 0.38511 0.23334 0.14414 -0.0009107 -0.26388 -0.20481 0.10099 0.14076 0.28834 -0.045429 0.37247 0.13645 -0.67457 0.22786 0.12599 0.029091 0.030428 -0.13028 0.19408 0.49014 -0.39121 -0.075952 0.074731 0.18902 -0.16922 -0.26019 -0.039771 -0.24153 0.10875 0.30434 0.036009 1.4264 0.12759 -0.073811 -0.20418 0.0080016 0.15381 0.20223 0.28274 0.096206 -0.33634 0.50983 0.32625 -0.26535 0.374 -0.30388 -0.40033 -0.04291 -0.067897 -0.29332 0.10978 -0.045365 0.23222 -0.31134 -0.28983 -0.66687 0.53097 0.19461 0.3667 0.26185 -0.65187 0.10266 0.11363 -0.12953 -0.68246 -0.18751 0.1476 1.0765 -0.22908 -0.0093435 -0.20651 -0.35225 -0.2672 -0.0034307 0.25906 0.21759 0.66158 0.1218 0.19957 -0.20303 0.34474 -0.24328 0.13139 -0.0088767 0.33617 0.030591 0.25577'

Each line is a string with the first element as the word and the remaining as its 200 dimensional embedding values. We will split this string on space, separate the word from the embeddings and convert the embeddings to float values. Let's store this in a dictionary so it feels more like a lookup table. There are issues with some of the strings in it, so we add these in a try except block. Anytime an error occurs, the loop will move to the next iteration.

# Initialize the dictionary
glove_dict = {}

for line in glove:
    try:
        elements = line.split()
        word, vector = elements[0], np.array([float(i) for i in elements[1:]])
        glove_dict[word] = vector
    except:
        continue

Our vocabulary definitely does not have all the words that are there in this dictionary. Also, there might be some vague words in our vocabulary that aren't in this dictionary. Our goal now is to construct a tensor of the size of our vocabulary (8372) where a word's vector will be replaced with the corresponding GloVe vector if it is present there; else, it will be kept random.

# Generate the embedding matrix of shape (vocab_size, 200)
# Our vocabulary is accessible as a dictionary using tokenizer.word_index

# Initialize random weight tensor
glove_weights = np.random.uniform(0, 1, (vocab_size, 200))
found = 0

for word in tokenizer.word_index.keys():
    if word in glove_dict.keys():
        # If word is present, replace the vector at its location ...
        # ... with corresponding GloVe vector
        glove_weights[tokenizer.word_index[word]] = glove_dict[word]
        found += 1
    else:
        # Otherwise, let it stay random
        continue
        
print("Number of words found in GloVe: {} / {}".format(found, vocab_size))

>>> Number of words found in GloVe: 7715 / 8372

Great! 7715 words from our vocabulary of 8372 words were found in GloVe. Now let's initialize a sample model and check if its outputs are as we expect them to be.

# Initialize the model
model = Network(glove_weights=torch.FloatTensor(glove_weights))

# Pass the batch we generate earlier
preds = model(img_in, caption_in)

# Print shape of output
print("Output shape:", preds.shape)

>>> Output shape: torch.Size([32, 8372])

This seems to be working fine as well. Now we'll write a few helper functions that we'll need to train the model.

def compute_loss(predictions, target):
    """ 
        Computes the nonlinear logloss for 
        predictions, given target 
    """
    # Expected shapes of inputs:
    # Prediction: (batch_size, vocab_size)
    # Targets: (batch_size)
    
    return F.nll_loss(predictions, target)
    
    
def learning_step(img_in, caption_in, caption_trg):
    """ 
        Given a batch of inputs and outputs, this function
        trains the model on it and updates its parameters
    """
    # Zero out optimizer gradients
    optimizer.zero_grad()
    
    # Generate predictions
    preds = model(img_in, caption_in)
    
    # Compute loss
    loss = compute_loss(preds, caption_trg)
    
    # Compute gradients by backpropagating loss 
    loss.backward()
    
    # Update model parameters through the optimizer
    optimizer.step()
    
    return loss

Training the model

Alright, let's train the model now. Here's our training strategy:

For every epoch, initialize the data generator and loss counter.
Perform len(all_captions) // batch_size number of iterations on the generator. This will ensure that we go over each image and all of its captions in the training dataset.
Perform a learning step for the batch generated during an iteration and update relevant objects.
Perform some console outputs so you can track its progress.
Save the model.

It is recommended that you reproduce this code in a Python script and run the script as a whole, instead of training the model here. This is important for people using GPUs and CUDA for training the model.

epochs = 100
batch_size = 32
steps_per_epoch = len(all_captions) // batch_size

# Initialize the model and optimizer
model = Network(glove_weights=torch.FloatTensor(glove_weights)).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.0003)


# Training loop
print("\n")


for epoch in range(epochs):
    print("Epoch {}".format(epoch+1))
    print("--------------------------------")

    # Initialize the data generator and loss counter
    d_gen = data_generator(train_features, train_captions, batch_size)
    total_loss = 0

    # Perform iterations over the generator
    for batch in range(steps_per_epoch):

        # Create a batch
        img_in, caption_in, caption_trg = next(d_gen)

        img_in = img_in.to(device)
        caption_in = caption_in.to(device)
        caption_trg = caption_trg.to(device)

        # Performing a learning step and record loss
        loss = learning_step(img_in, caption_in, caption_trg)

        # Add to total loss
        total_loss += loss

        # Provide a status update every 1000 steps (total 5663)
        if batch % 500 == 0:
            print("Epoch {} - Batch {} - Loss {:.4f}".format(
                epoch+1, batch, loss
            ))

    # Average the loss for this epoch and print it
    epoch_loss = total_loss/steps_per_epoch

    print("\nEpoch {} - Average loss {:.4f}".format(
        epoch+1, epoch_loss
    ))

    # Save the model
    torch.save(model.state_dict(), "../saved_data/models/model_{}".format(epoch+1))

    print("\n======================================\n")

Evaluation

Now that we have a trained model, we'll see how well it does. We will write a translation function, and another function to quantify its performance using a performance metric called BLEU score.

Translation

We follow the same strategy for translation as we did while training. The translate function only needs the image, which we provide as its VGG-16 features. The startseq token starts off the prediction. Each time, we predict a word, add it to our sentence, tokenize the new sentence (and pad it) and continue the procedure. The loop is broken either when endseq is encountered or maximum length is reached.

BLEU scores

To compare model performance for seq2seq tasks (ones where it generates a sequence and we have to match it with reference sequences to see how well it has done), we use BLEU scores. We will not discuss its computation in details (you can read more about it here), but below is what we need to know.

This function checks the correctness of the generated sentence at multiple levels, as specified by the user. That is, it checks uni-gram matches (single words, BLEU-1), bi-gram matches (two word pairs, BLEU-2), tri-grams, and so on. For our purposes, we will check the values of BLEU-1 to BLEU-4. A decent model should have its BLEU scores in the following ranges (taken from Where to put the Image in an Image Caption Generator, a 2017 paper).

BLEU-1: 0.401 to 0.578
BLEU-2: 0.176 to 0.390
BLEU-3: 0.099 to 0.260
BLEU-4: 0.059 to 0.170

To compute BLEU scores, we will use the corpus_bleu function from nltk.translate.bleu_score. It needs 3 parameters:

List of references: List of documents (lists as well), each document being the set of possible correct translations.
List of hypotheses: List of predictions.
Weights: These determine the value of K in BLEU-K scores.

# Translate function

def translate(features):
    
    # Convert features to a Float Tensor
    features = torch.FloatTensor(features)
    
    # String to which words keep getting added 
    result = "startseq "
    
    for t in range(1, max_length-1):
        
        # Tokenize the current sentence
        in_seq = tokenizer.texts_to_sequences([result])
        
        # Pad it to max_length and convert to torch LongTensor
        in_seq = pad_sequences(in_seq, maxlen=max_length, padding='pre')
        in_seq = torch.LongTensor(in_seq)
        
        # Generate predictions and pick out the predicted word ...
        # ... from tokenizer's index_word map
        preds = model(features, in_seq)
        pred_idx = preds.argmax(dim=-1).detach().numpy()[0]
        word = tokenizer.index_word.get(pred_idx)
        
        # If no word is returned or endseq is returned, stop the process
        if word is None or word == 'endseq':
            break
            
        # Otherwise add the predicted word to the sentence
        result += word + " "
        
    # Return the sentence minus the startseq
    return " ".join(result.split()[1:])


# Function compute BLEU scores

def evaluate_model(feature_dict, caption_dict):
    
    # Initialize lists to store references and hypotheses
    references = []
    hypotheses = []
    
    for name in tqdm(feature_dict.keys()):
            
        # Generate prediction and append that to hypotheses
        prediction = translate(feature_dict[name])
        hypotheses.append(prediction.split())

        # Get all reference captions in a list and append ...
        # ... it to references
        refs = [caption.split() for caption in caption_dict[name]]
        references.append(refs)
            
    # Compute BLEU scores
    bleu_1 = corpus_bleu(references, hypotheses, weights=(1.0, 0, 0, 0))
    bleu_2 = corpus_bleu(references, hypotheses, weights=(0.5, 0.5, 0, 0))
    bleu_3 = corpus_bleu(references, hypotheses, weights=(0.33, 0.33, 0.33, 0))
    bleu_4 = corpus_bleu(references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25))
    
    print("BLEU-1: {:.4f}".format(bleu_1))
    print("BLEU-2: {:.4f}".format(bleu_2))
    print("BLEU-3: {:.4f}".format(bleu_3))
    print("BLEU-4: {:.4f}".format(bleu_4))

Loading saved model

Let's now load the saved model and see how well it performs.

# We'll accept captions that are at most 20 words long
max_length = 20

# Initialize a new model
model = Network(glover_weights=torch.FloatTensor(glove_weights))

# Copy trained model's weights and other parameters to it.
# map_location for CUDA users only, it loads everything to your CPU
model.load_state_dict(torch.load("../saved_data/models/model_36", map_location=torch.device('cpu')))

# Train data evaluation
evaluate_model(train_features, train_captions)

# Test data evaluation
evaluate_model(test_features, test_captions)

Here are the BLEU scores I received for this model.

>>> Training evaluation
      BLEU-1: 0.4176
      BLUE-2: 0.2775
      BLEU-3: 0.2132
      BLEU-4: 0.1733
      
>>> Testing evaluation
      BLEU-1: 0.3485
      BLEU-2: 0.1750
      BLEU-3: 0.0933
      BLEU-4: 0.0448

The model performs surprisingly well on the training dataset, and decent on the test dataset. There could be multiple reasons to this.

Overfitting. There might have been too many parameters to train for the amount of data we have. Given the gap in training and test metric values (and the brilliant performance on training data), this is the most likely case.
There were elements in the images of test dataset that weren't encountered by the model very often in the train dataset. It didn't learn how to use the words describing them very well, and it couldn't reproduce them during testing.
The model needs to train more. This would depend on heuristics, however.

Examples

Let's now look at some examples and see how good or bad it really is. They're not perfect; considering how simple the model is, I think they're decent.

Endnote

Here you are! A decent image caption generator in no time (excluding the training time). The model that I proposed here is very simplistic. There's definitely much better models out there which can do the job much better than this one. However, this was meant to be a primer and fun project for people venturing into natural language processing and computer vision.

Experiment a little more and you might find something better than what I have here. Here's some food for thought.

While generating captions, what does the model actually use? Does it use the image features that we attached, or the words generated so far? A combination of both? If yes, what kind of combination? You might want to have a look at attention mechanism for natural language processing.
What do the features extracted by VGGNet represent? If there are three images, two with dogs and one with kids, will the vectors for dog images be closer in distance than the vector for kids' image?

What deep learning is truly capable of is bound (or not) by your creativity and imagination.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
code		code
images		images
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

images

images

README.md

README.md

_config.yml

_config.yml

Repository files navigation

System specifications

Dataset downloads

Special note

Image feature generation

Caption processing

Tokenization

Preprocessing

Padding

Model definition

Recurrent neural networks

Gradient issues and LSTMs

I/O specifications for recurrent layers

Embeddings and GloVe

Model structure

Data Generator

Processing GloVe embeddings

Training the model

Evaluation

Translation

BLEU scores

Loading saved model

Examples

Endnote

About

Releases

Packages

Languages

NishantPrabhu/Image-Captioning-with-PyTorch-and-Keras

Folders and files

Latest commit

History

Repository files navigation

System specifications

Dataset downloads

Special note

Image feature generation

Caption processing

Tokenization

Preprocessing

Padding

Model definition

Recurrent neural networks

Gradient issues and LSTMs

I/O specifications for recurrent layers

Embeddings and GloVe

Model structure

Data Generator

Processing GloVe embeddings

Training the model

Evaluation

Translation

BLEU scores

Loading saved model

Examples

Endnote

About

Resources

Stars

Watchers

Forks

Languages