Assignment as part of online CS321n course.
Run Test-time Sampling in RNN_Captioning.ipynb First line is generated caption, second line is the actual caption given in the data.
Not very sophisticated, as you can see, but picks up vague details
inb4 LSTM
For semantic segmentation(required to obtain candidate image boxes): http://arxiv.org/pdf/1311.2524v5.pdf
Karpathy's paper: http://cs.stanford.edu/people/karpathy/cvpr2015.pdf