COCO Context Collector - Multimodal Learning

It's a Contextualizer, trained on COCO! See what I did there?

This mixed vision-language model gets better by making mistakes

Trained on COCO (50 GB, 2017 challenge)

git clone https://github.com/AndreiMoraru123/ContextCollector.git
cd ContextCollector
chmod +x make
./make

Via the Python API

pip install pycocotools

Click here to see some more examples

Based on the original paper: Show, Attend and Tell

Frame goes in, caption comes out.

Note

Make sure to check the original implementation first, because this is the model that I am using.

Motivation

The functional purpose of this project could be summed up as Instance Captioning, as in not trying to caption the whole frame, but only part of it. This approach is not only going to be faster (because the model is not attempting to encode the information of the whole image), but it can also prove more reliable for video inference, through a very simple mechanism I will call "expansion".

The deeper motivation for working on this is, however, more profound.

For decades, language and vision were treated as completely different problems and naturally, the paths of engineering that have emerged to provide solutions for them were divergent to begin with.

Neural networks, while perhaps the truce between the two, as their application in deep learning considerably improved both language and vision, still today rely mostly on different techniques for each task, as if language and vision would be disconnected from one another.

The latest show in town, the Transformer architecture, has provided a great advancement into the world of language models, following the original paper Attention is All You Need that paved the way to models like GPT-3, and while the success has not been completely transferred to vision, some breakthroughs have been made: An Image is Worth 16x16 Words, SegFormer, DINO.

One of the very newest (time of writing: fall 2022) is Google's LM-Nav, a Large Vision + Language model used for robotic navigation. What is thought provoking about this project is the ability of a combined V+L model to "understand" the world better than a V or L model would do on their own. Perhaps human intelligence itself is the sum of smaller combined intelligent models. The robot is presented with conflicting scenarios and is able to even "tell" if a prompt makes sense as a navigational instruction or is impossible to fulfil.

Vocabulary and Data

As the official dataset homepage states, "COCO is a large-scale object detection, segmentation, and captioning dataset".

For this particular model, I am concerned with detection and captioning.

Before the CocoDataset can be created in the cocodata.py file, a vocabulary instance of the Vocabulary class has to be constructed using the vocabulary.py file. This can be conveniently done using the tokenize function of of nltk module.

The Vocabulary is simply the collection of words that the model needs to learn. It also needs to convert said words into numbers, as the decoder can only process them as such. To be able to read the output of the model, they also need to be converted back. These two are done using two hash maps (dicts), word2idx and idx2word.

As per all sequence to sequence models, the vocab has to have a known <start> token, as well as an <end> one. An <unk> token for the unknown words, yet to be added to the file acts as a selector for what gets in.

The vocabulary is, of course, built on the COCO annotations available for the images.

The important thing to know here is that each vocabulary generation can (and should) be customized. The instance will not simply add all the words that it can find in the annotations file, because a lot would be redundant.

For this reason, two vocabulary hyper-parameters can be tuned:

word_threshold = 6  # minimum word count threshold (if a word occurs less than 6 times, it is discarded)
vocab_from_file = False  # if True, load existing vocab file. If False, create vocab file from scratch

and, because the inference depends on the built vocabulary, the word_treshold can be set only while in training mode, and the vocab_from_file trigger can only be set to True while in testing mode.

Building the vocabulary will generate the vocab.pkl pickle file, which can then be later loaded for inference.

Model description

1. The CNN Encoder

2. The Attention Network

3. The RNN Decoder

$$I \to \text{Input ROI (region of interest)}$$

$$S = \{ S_0, S_1, ..., S_n \} \to \text{Target sequence of words}, \: S_i \in \mathbb{R}^{K} \\$$

$$\text{Where} \: K = \text{the size of the dictionary}$$

$$p(S | I) \to \text{likelihood}$$

$$\text{The goal is to tweak the params in order to max the probability of a generated sequence being correct given a frame}$$

$$\theta^{*} = \arg \max_{\theta} \log p(S|I; \theta)$$

$$\log p(S|I) = \sum_{i=1}^{n} \underbrace{\log p(S_i|S_{1},\dots,S_{i-1},I)}_{\text{modeled with an RNN}}$$

Then the forward feed is as follows:

The image is first (and only once) encoded into the annotation vectors

$$x_{-1} = \text{CNN}(I)$$

The context vectors are calculated from both the encoder output, and the hidden state (initially a mean of the encoder output), using Bahdanau alignments.

$$x_t = \text{WeSt}, t \in \{0, \dots, N-1\} \to \text{ this is a joint embedding representation of the context vector}$$

The model outputs the probability for the next word, given the current word (the first being the <start> token). It keeps on going until it reaches the <end> token.

$$p_{t+1} = \text{LSTM}(x_t), t \in \{0, \dots, N-1\}$$

The attention itself is the alignment between the encoder's output (vision) and the decoder hidden state (language):

$$e_t = f_{\text{att}}(a, h_{t-1}) \quad\text{(a miniature neural network with a non-linear activation of two linear combinations)}$$

$$h_{t-1} = \text{hidden state} \quad\text{ and} \quad a = \text{annotation vectors}$$

$$a = {a_1, a_2, ..., a_L} \in \mathbb{R}^D \quad (D = 2048, L = 28 \times 28)$$

$$\text{In this equation, $a$ represents the output feature map of the encoder, which is a collection of $L$ activations}$$

$$\text{Each activation $a_i$ corresponds to a pixel in the input image, and is a vector of dimension $D=2048$}$$

$$\text{obtained by projecting the pixel features into a high-dimensional space.}$$

$$\text{Collectively, the feature map $a$ captures information about the contents of the input image}$$

$$\alpha_{t,i} = \frac{\exp(e_t)}{\sum_k \exp(e_{t,k})} \quad\text{(probability of each pixel worth being attended to)}$$

$$\quad\text{(results in the instance segmentation-like effect seen in the paper)}$$

$$awe = f_i({a_i}, {\alpha_i}) = \beta \sum_i [a_i, \alpha_i] \quad\text{(attention weighted encoding)}$$

$$\quad\text{(element-wise multiplication of each pixel and its probability)}$$

$$\quad\text{(achieves a weighted sum vector when added up across the pixels' dimensionality)}$$

$$\beta = \sigma(f_b(h_{t-1})) \quad\text{(gating scalar used in the paper to achieve better results)}$$

The expansion mechanism builts upon detection in the following way:

$$\text{If } \forall S_i \neq \text{label} \text{ for any } i \in \{1, \dots, n\}, \text{ then } I = I + \phi \cdot I, \text{ where } 0 \leq \phi \leq 1 \text{ and } I \leq I + \phi \cdot I \leq I_{\max}$$

Which means any time none of the output words match the prediction of the detector, the ROI in which the model looks is resized, therefore allowing the model to "collect more context". In this case, label is the category prediction of YOLO.

As found in model.py

Encoder

The encoder is a beheaded pretrained ResNet-152 model that outputs a feature vector of size 2048 x W x H for each image, where W and H are both the encoded_image_size used in the last average pooling. The original paper proposed an encoded size of 14.

Since ResNet was originally designed as a classifier, the last layer is going to be the activation function Softmax.

However, since PyTorch deals with probabilities implicitly using CrossEntropyLoss, the classifier will not be present, and the only layers that need to be beheaded are the last linear fully connected layer and the average pooling layer, which will be replaced by the custom average pooling layer, for which you and I can choose the pooling size.

The freeze_grad function is there if you need to tailor how many (if any) of the encoder layers do you want to train (optional, since the Net is pretrained).

The purpose of the resulting feature map is to provide a latent space representation of each frame, from which the decoder can draw multiple conclusions.

Any ResNet architecture (any depth) will work here, as well as some of the other predating CNNs (the paper used VGG), but keep in mind memory constraints for inference.

You can check how torchvision implements this below:

Attention

Here is an interesting experiment on human perception conducted by Corbetta & Shulman to go along with this:

Why?

"One important property of human perception is that one does not tend to process a whole scene in its entirety at once. Instead humans focus attention selectively on parts of the visual space to acquire information when and where it is needed" -- Recurrent Models of Visual Attention

The great gain of using attention as a mechanism in the decoder is that the importantce of the information contained in the encoded latent space is held into account and weighted (as in across all pixels of the latent space). Namely, the attention lifts the burden of having a single dominant state taking guesses about what is the context of information taken from the features by the model. The results are actually quite astounding when compared to an attention-less network (see previous project).

Where?

Since the encoder is already trained and can output a competent feature map (we know that ResNet can classify images), the mechanism of attention is used to augument the behaviour of the RNN decoder. During the training phase, the decoder learns which parts of the latent space make up the "context" of an image. The selling point of this approach is based on the fact that the learning is not done in a simple, sequential manner, but some non-linear interpolations can occur in such a way that you could make a strong point for convincing someone that the model has actually "understood" the task.

What kind?

The original paper, as well as this implementation, use Additive / Bahdanau Attention

The formula for the Bahdanau Attention is the essentially the following:

alpha = tanh((W1 * e) + (W2 * h))

where e is the output of the encoder, h is the hidden previous state of the decoder, and W1 and W2 are trainable weight matrices, producing a single number. (Note that the original paper also used tanh as a preactivation before softmax. This implementation instead uses ReLU.

Additive attention is a model in and of itself, because it is in essence just a feed forward neural network. This is why it is built as an nn.Module class and inherits a forward call.

But how does Attention actually work here?

The paper itself cites Bahdanau, but does not go in depth on the reasoning behind this architecture. Here is how to make sense of it:

The matrices W1 and W2 have the purpose to project the encoder features and the hidden state of the decoder into the same dimensionality so that it can add them.

Adding them element-wise means the model is forced to minimize the loss for the features of the image as well as it's captions, so it "must find" some connection between them.

As attention is going to be non-linear, this is why we activate the sum using ReLU or tanh. The result is going to be squeeze into a single neuron, than, once softmax-ed will hold the probability of each neuron bein worth "attending to". Notice that the features of the encoder are expressed in number of pixels, not W x H, as it was passed through a view before the attention call. This means that the single neuron computation is done for all the pixels in the annotation vector.

Below is a gif from TensorFlow playground that serves as a simplified example:

For the two features of the data, the X and Y coordinates, we can use 4 neurons to learn 4 lines, one line per neuron. This is what the projection of the attention_dim is doing. The final neuron can just learn a linear combination of the previous 4 in the hidden layer. This is what the full_att layer is esentially doing by mapping the attention_dim neurons to a single one.

Therefore, after getting the probability of each neuron to be attented to, we can multiply these probabilities with the pixel values themselves, and sum across that dimension. This is going to result in a weighted sum, and now this is exactly the context vector the paper is talking about. (When you sum across a dimension, say 196 for the number of pixels, you lose that dimension as it becomes 1, this is how the vectors are turned into a single vector, which can then be passed to the LSTM for computation)

Here is a gif so you can find the concepts of the paper in code easier:

Decoder

I am using pretty much the same implementation proposed in the greatly elaborated Image Captioning repo with some caveats. Precisely:

I do not use padded sequences for the captions
I tailored tensor dimensions and types for a different pipeline (and dataset as well, the repo uses COCO 2014), so you may see differences
I am more lax with using incomplete captions in the beam search and I am also not concerned with visualizing the attention weights

The aformentioned implementation is self sufficient, but I will further explain how the decoder works for the purpose of this particular project, as well as the statements above.

The main idea of the model workflow is that the Encoder is passing a "context" feature to the decoder, which in turn produces an output. Since the decoder is an RNN, so the outputs will be given in sequences. The recurrent network can take into account the inputed features as well as its own hidden state.

The attention weighted encoding is gated through a sigmoid activation and the resulting values are added to the embedding of the previous word. This concatenation is then passed as the input to an LSTMCell, along with the previous hidden state.

The LSTM Cell

The embedded image captions are concatenated with gated attention encodings and passed as the input of the LSTMCell. If this were an attentionless mechanism, you would just pass the encoded features added to the embeddings.

Concatenation in code will look like this:

self.lstm = nn.LSTMCell(embeddings_size + encoded_features_size, decoded_hidden_size)

The decoded dimension, i.e. the hidden size of the LSTMCell is obtained by concatenating the hidden an cell states. This is called a joint embedding architecture, because, well, you are smashing them both into the same vectorized world representation.

hidden_state, cell_state = self.lstm( torch.cat([embeddings[:batch_size_t, t, :], attention_weighted_encoding], dim=1),  # input
                                      (hidden_state[:batch_size_t], cell_state[:batch_size_t]) )  # hidden

The cell outputs a tuple made out of the next hidden and cell states like in the picture down below.

The intuition and computation behind the mechanism of the long short term memory unit are as follow:

The cell operates with a long term memory and a short term one. As their names intuitively convey, the former is concerned with a more general sense of state, while the latter is concentrated around what it has just seen.

In the picture up above as well as in this model, h represents the short term memory, or the hidden state, while c represents the long term memory, or the cell state.

The long term memory is initially passed through a forget gate.The forget factor of this gate is computed using a sigmoid, which ideally behaves like a binary selector (something either gets forgotten [0] or not [1]. In practice, most values will not be saturated so the information will be somewhat forgotten (0,1). The current hidden state or short term memory is passed through the sigmoid to achieve this forget factor, which is then point-by-point multiplied with the long term memory or cell state.
The short term memory will be joined by the input event, x (which represents what the cell has just seen/experienced) in the input gate, also called the learn gate. This computation is done by gating both the input and the hidden state through an ignore gate. The ignore factor of the gate is represented by a sigmoid to again ideally classify what has to be ignored [0] and what not [1]. How much is to be ignored is then decided by a tanh activation.
The long term memory joined by the newly aquired information in the input gate is passed into the remember gate and it becomes the new cell state and the new long term memory of the LSTM. The operation is a point-by-point addition of the two.
The output gate takes in all of the information from the input, hidden and cell state and becomes the new hidden state and short term memory of the network. The long term memory is passed through a tanh while the short term memory is passed through a sigmoid, before being multiplied point-by-point in the final computation.

Teacher Forcing

You may notice in the gif below that, during training, we are decoding every time based on the embeddings, which are the training labels themselves, instead of using the embeddings only for the first computation and then sending in the output predictions, like they did in Show and Tell. This is called Teacher Forcing, and you can imagine that it definitely speeds up the learning process:

Now we have a new problem. What this means is that the model is going to memorize the captions by heart for each image, because the only prediction that minimizes the loss word for word for a given caption is going to be the exact same sentence.

Then why are we doing this? Here is the fascinating part: the model is not learning semantics and compositionality during training, but you can notice it is learning the alphas, which means it will remember what each word is supposed to look like in an image representation. This is why we are not calling the forward function during inference, that would be useless. What the authors are doing instead is using a beam search algorithm to form sentences different from the training labels, and you can find that in the sample function. This is the function you would call during inference.

Training the model

To train this model run the train.py file with the argument parsers tailored to your choice. My configuration so far has been something like this:

embed_size = 300  # this is the size of the embedding of a word, 
                  # i.e. exactly how many numbers will represent each word in the vocabulary.
                  # This is done using a look-up table through nn.Embedding 

attention_dim = 300  # this is the size of the full length attention dimension,
                     # i.e. exactly how many pixels are worth attenting to. 
                     # The pixels themselves will be learned through training
                     # and this last linear dimension will be sotfmax-ed 
                     # such as to output probabilities in the forward pass.

decoder_dim = 300  # this is the dimension of the hidden size of the LSTM cell
                   # and it will be the last input of the last fully connected layer
                   # that maps the vectorized words to their scores

Now, there is no reason to keep all three at the same size, but you can intuitively see that it makes sense to keep them around the same range. You can try larger dimnesions, but keep in mind again hardware limitations, as these are held in memory.

The rest of the parsed arguments are:

dropout = 0.5  # the only drop out is at the last fully connected layer in the decoder,
               # the one that outputs the predictions based on the resulted hidden state of the LSTM cell
               
num_epochs = 5  # keep in mind that training an epoch may take several hours on most machines

batch_size = 22  # this one is as well depended on how many images can your GPU hold at once
                 # I cannot go much higher, so the training will take longer

word_threshold = 6  #  the minimum number of apparitions for a word to be included in the vocabulary

vocab_from_file = False  # if this is the first time of training / you do not have the pickle file,
                         # then you will have to generate the vocabulary first
                       
save_every = 1  # save every chosen epoch

print_every = 100  # log stats every chosen number of batches

The loss function is CrossEntropyLoss and should not be changed as this is the only one that makes sense. Captioning is just multi-label classifcation.

The train_transform the images go through before being passed to the encoder is pretty standard, using the ImagNet mean and std values.

Since the input sizes here do not vary it may make sense to set:

torch.backends.cudnn.benchmark = True  # optimize hardware algorithm

Beam Search

In the sample function of the decoder, there is an input parameter called k. This one represents the number of captions held into consideration for future exploration.

The beam search is a thing in machine translation, because you do not always want the next best word, as the word that comes after that may not be the overall best to form a meaningful sentence.

Always looking for the next best is called a greedy search, and you can achieve that by setting k = 1, such as to only hold one hypothesis every time.

Again, keep in mind that, provided you have one, this search will also be transfered to your graphics card, so you may run out of memory if you try to keep count of too many posibilities.

That means you may sometimes be forced to either use a greedy search, or break the sentences before they finish.

I'll leave you with this visual example on how beam search can select two nodes in a graph instead of only one.

Here is a comparison of how the model behaves using a beam width of 1 (i.e. greedy search) vs one of 10:

You can definitely see that k=1 achieves a higher FPS rate, but at the cost of accuracy, while the k=10 beam is more accurate, but at a performance cost, as the k possibilities are held on the GPU.

YOLO and the Perspective Expansion

Trying to output a caption for each frame of a video can be painful, even with attention. The model was trained on images from the COCO dataset, which are context rich scenarios, focused mainly on a single event, and thus will perform as such on the testing set.

But "real life" videos are different, each frame is related to the previous one and not all of them have much going on in one place, but rather many things happening at once.

For this reason, I use a tiny YOLOv4 model to get an initial object of interest in the frame.
A caption is then generated for the region of interest (ROI) bounded by the YOLO generated box
If the prediction is far off the truth (no word in the sentence matches the label output by the detector), the algo expands the ROI by a given factor until it does or until a certain number of tries have been made, to avoid infinite loops
Using the newly expanded ROI, the model is able to get more context out of the frame
As you can see in the examples, the expansion factor usually finds its comfortable space before reaching a full sized image
That means there are significant gains in inference speeds and better predictions
Much like in Viola Jones, this model expands, but not when being correct.
Instead, it grows by making obvious mistakes, and in fact relies on it to give its best performance in terms of context understanding.

Inference Pipeline

I provided some model pruning functions in the pipeline.py file, both structured and unstructured (global and local), but I use neither and do not recommend them as they are now. You could achieve faster inference by cutting out neurons or connections, but you will also hinder the performance.

I highly avoid structured pruning (both L1 and L2), as it will just wipe out most of the learned vocabulary, at no speed gains.

Example:

a man <unk> <unk> <unk> a <unk> <unk> <unk> <unk> .
a man <unk> <unk> <unk> a <unk> <unk> <unk> .
a <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> .
a <unk> <unk> <unk> <unk> <unk> <unk> <unk> .

While unstructured (both local and global) pruning is safer:

a man on a motorcycle in the grass .
a motorcycle parked on the side of the road .
a man on a skateboard in a park .
a person on a motorcycle in the woods .

But no more performant in terms of speed

Local pruning works layer by layer across every layer, while global pruning wipes across all layers indiscriminately. But for the purpose of this model, they both produce no gain.

Unstructured pruning is always L1, because the weights are sorted one after the other.

the JIT compiler can be used to increase the performance using the optimized_execution. However, this does not always result in a smaller model, and it could in fact make the network increase in size.

Neither torch.jit nor onnx converters can be used on the decoder, because it is very customized, and these operations for now require strong tensor typing, and are not very permissive to custom architectures, so I resorted to only tracing the ResNet encoder (which also cannot be inferenced using onnxruntime, because of the custom average pooling layer).

As you can start to see, there are not really any out of the box solutions for these types of things yet.

The rest of the inference pipeline just loads the state_dicts of each model and runs the data stream through them using a pretty standard test_transform and dealing with the expansion of the ROI.

Running the model

To test the model you can run the run.py file by parsing the needed arguments.

Since the prediction of the net relies on teacher forcing, i.e. using the whole caption for inference regardless of the last generated sequence, the whole vocabulary is needed to test the model, meaning that the vocab.pkl file has to be used, as well as the dataset.

I also cannot provide the encoder here as there are size constraints, but any pretrained resnet will work (do make sure to behead it first if you choose to try this out).

The options for running the model are as follow:

--video  # this is an mp4 video that will be used for inference, I provide one in the video folder
--expand  # this is the expanding ratio of the bounding box ROI after each mistake
--backend  # this is best set to 'cuda', but be weary of memory limitations
--k  # this is the number of nodes (captions) held for future consideration in the beam search
--conf  # this is the confidence threshold for YOLO
--nms  # this is the non-maximum suppression for the YOLO rendered bounding boxes

YOLO inference is done using the dnn module from OpenCV.

Hardware and Limitations

My configuration is the following:

I am using:

a turing Geforce GTX 1660 TI with 6GB of memory (CUDA arch bin of 7.5)
CUDA 11.7
cuDNN 8.5 (so that it works with OpenCV 4.5.2)

Be aware that when building OpenCV there will be no errors if your pick incompatible versions. However, unless everything clicks, the net will refuse to run of the GPU

Using the computation FPS = 1 / inference_time, the model is able to average 5 frames per second.

Future outlook and goals

What I am currently looking into is optimization.

The current model is working, but in a hindered state. With greater embeddings and a richer vocabulary the outputs can potentially be better. Training in larger batches will also finish faster.

For this reason, I am now currently working on Weight Quantization and Knowledge Distillation.

I am also currently looking into deployment tools using ONNX.

These are both not provided off the bat for artificial intelligence models, so there is really no go-to solution. I will keep updating the repository as I make progress.

I am also playing around with the Intel Neural Compute Stick and the OpenVINO api to split the inference of the different networks away from running out of CUDA memory.

Some more examples

Notice how in the motorcycle example the ROI expands until it can notice there is not only one, but a group of people riding motorcycles, something object detection itself is incapable of accomplishing.

Shift	In	Perspective

The	Big	Picture

Multi	Purpose

Context	Collector

Based on the original work:

@misc{https://doi.org/10.48550/arxiv.1502.03044,
  doi = {10.48550/ARXIV.1502.03044},
  url = {https://arxiv.org/abs/1502.03044},
  author = {Xu, Kelvin and Ba, Jimmy and Kiros, Ryan and Cho, Kyunghyun and Courville, Aaron and Salakhutdinov, Ruslan and Zemel, Richard and Bengio, Yoshua},
  keywords = {Machine Learning (cs.LG), Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Show, Attend and Tell: Neural Image Caption Generation with Visual Attention},
  publisher = {arXiv},
  year = {2015},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

and Repo

Bloopers

I think there is a big Ferrari in the middle of this scene, and it should be the center of attention. Not sure though.

Name		Name	Last commit message	Last commit date
Latest commit History 362 Commits
.github/workflows		.github/workflows
PythonAPI		PythonAPI
YOLO		YOLO
models		models
videos		videos
.flake8		.flake8
LICENSE		LICENSE
README.md		README.md
cocodata.py		cocodata.py
make		make
model.py		model.py
pipeline.py		pipeline.py
run.py		run.py
train.py		train.py
training_log.txt		training_log.txt
vocab.pkl		vocab.pkl
vocabulary.py		vocabulary.py

License

AndreiMoraru123/ContextCollector

Folders and files

Latest commit

History

Repository files navigation