notes.txt


    
### ImageNet as an attention game
  2017 overview
    https://medium.com/towards-data-science/visual-attention-model-in-deep-learning-708813c2912c
    https://github.com/tianyu-tristan/Visual-Attention-Model

  Learning to combine foveal glimpses with a third-order Boltzmann machine
    https://papers.nips.cc/paper/4089-learning-to-combine-foveal-glimpses-with-a-third-order-boltzmann-machine
      Hugo Larochelle
      Geoffrey Hinton
    Issues:
      RBM approach
      Very retina-like field of view

  Learning where to Attend with Deep Architectures for Image Tracking
    https://arxiv.org/abs/1109.3737
    v. early 
    More like tracking
    Some Bayesian elements
      Misha Denil
      Loris Bazzani
      Hugo Larochelle
      Nando de Freitas
    Issues:
    
  Recurrent Models of Visual Attention  (DeepMind)
    https://arxiv.org/abs/1406.6247
      Volodymyr Mnih  *
      Nicolas Heess
      Alex Graves
      Koray Kavukcuoglu
    Blog Post:
      http://torch.ch/blog/2015/09/21/rmva.html
        This shows that each 8x8 pixel glimpse of 28x28 MNIST is individually tricky
    Issues:
      Glimpse size can be quite large 
        28x28 regular    MNIST sampled with 1 to 7 steps of 8x8 patches
        60x60 translated MNIST sampled with 1 to 3 steps of 8x8 patches, but covering up to 8*2*2=32x32
        p6 claims that a single 8x8 is insufficient to classify MNIST digits
        p9 shows that digits are clearly recognisable from 1 3-step glimpse patch
        
  On Learning Where To Look
    https://arxiv.org/abs/1405.5488
      Marc'Aurelio Ranzato (Google)
    Issues:
      1.  train N0 and N1
      2.  train N2, fixing N0 and N1
      3.  train N3, fixing N0, N1 and N2
      Help from : Hinton (xStudent?)
      
  Multiple Object Recognitions with Visual Attention (ICLR 2015, DeepMind)
    https://arxiv.org/pdf/1412.7755.pdf
      Jimmy Lei Ba
      Volodymyr Mnih  *
      Koray Kavukcuoglu  
    Blog Post:
      https://netsprawl.wordpress.com/2016/07/26/recurrent-attention/
      https://github.com/jrbtaylor/visual-attention
        Ultimately, I decided to abandon this track of research. 
        There may be some applications with extremely large images, 
          like microscopy, where a hard attention mechanism is necessary for now (until GPU memory can hold the images), 
        but otherwise the policy learning is so much slower than the convnet that the trade-off never works out.
    Issues:
      SVHN dataset (multiple digits recognised)
      Attention model took ~3 days on 'a' GPU
      Help from : Geoffrey Hinton, Nando de Freitas and Chris Summerfield
      
  Spatial Transformer Networks (DeepMind)
    Paper:
      https://arxiv.org/abs/1506.02025
      https://arxiv.org/pdf/1506.02025.pdf
        Max Jaderberg
        Karen Simonyan
        Andrew Zisserman
        Koray Kavukcuoglu
    Blog posts:
      https://kevinzakka.github.io/2017/01/10/stn-part1/
      https://kevinzakka.github.io/2017/01/18/stn-part2/
    Implementations:
      https://github.com/qassemoquab/stnbhwd (torch)
      https://github.com/kevinzakka/spatial_transformer_network  (TF)

  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models
    https://arxiv.org/pdf/1603.08575.pdf
      S. M. Ali Eslami
      Nicolas Heess
      Theophane Weber
      Yuval Tassa
      David Szepesvari
      Koray Kavukcuoglu
      Geoffrey E. Hinton
    
      
  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
    https://arxiv.org/pdf/1502.03044.pdf
      Kelvin Xu
      Jimmy Lei Ba     (DeepMind?)
      Ryan Kiros
      Kyunghyun Cho
      Aaron Courville  *
      Ruslan Salakhutdinov *
      Richard S. Zemel
      Yoshua Bengio  *

   https://github.com/atulkum/paper_implementation  in TensorFlow
     Recurrent Models of Visual Attention https://arxiv.org/abs/1406.6247
     Multiple Object Recognition with Visual Attention https://arxiv.org/abs/1412.7755
     Show, Attend and Tell: Neural Image Caption Generation with Visual Attention https://arxiv.org/abs/1502.03044

   Recurrent Attention in Computer Vision
     https://netsprawl.wordpress.com/2016/07/26/recurrent-attention/
       https://github.com/jrbtaylor/visual-attention  (unfinished)
       
       
## Implementation :

Let's create a X/Y coordinate thing (2d)

The 'keys' for the image will be the first 10 of the 24 /concat with/ the coordinate thing
The 'values' for the image will be the second 14 of the 24 /concat with/ the coordinate thing

So the attention query vector (output) will need to be n_q=12 wide, and will retrieve a n_v=16 wide value

To get the attention weights, create a dot-product of the k&q numbers
*  Option 1 : Softmax these.  
*  Option 2 : use a temperature parameter on this vector Gumbel-SoftMax (Google ICLR-2017)
*  Option 3 : Harden the Gumbel-SoftMax outputs for testing (and training?)

https://arxiv.org/abs/1611.01144
  Categorical Reparameterization with Gumbel-Softmax
    This distribution has the essential property that it can be smoothly annealed into a categorical distribution. 
    Use this as an 'action' when temperature -> 0
See also : 
  https://casmls.github.io/general/2017/02/01/GumbelSoftmax.html

Since we don't know what 'language' will be constructed for the internal dialogue,
  we can't do teacher-forcing.  Will have to go the slower route...
  Similarly, using a dilated-CNN doesn't make sense, since part of the advantage is when doing forcing

So, the input to the LSTM thing at the bottom will be n_v-wide
  Initial input should be (say) all-zero
output stage to the LSTM thing at the top will be n_q-wide

Maximum length of objects to be processed is ~6.  
So make LSTM stage 8 long

#1 layer version
#  1 layer of LSTM.   n_v input size, hidden_size = (question_size+answer_size), n_q output size
#  - hidden state initialised with (question + trainable_vector(answer_size)) 
#     Output 'answer' corresponds to answer_size portion of hidden units

#2 layer version
#  2 layers of LSTM.   n_v input size, hidden_size_1 = question_size, hidden_size_2 = answer_size, n_q output size
#  - bottom one initialised with question - final output is ignored
#  - top one initialised with zero - and outputs answer
#     Output 'answer' should be hidden units of LSTM layer 2

Basic Query IO : 
  question=11d
  answer=10d

Two major layers : 
  First RNN : (The internal dialogue stage)
    inputs = 'vs(16d)' from image, hidden0=question[0:11]+0s[0:5]=16d, output = hidden*W = qs(12d) -> softmax -> weighted vs for next step

  Second RNN : (The interpreting the dialogue stage - in two stages)
    inputs = outputs from previous layer, hidden0=16d=question[0:11]+0s[0:5]=16d
    inputs = hiddens from previous layer, hidden0=0s[0:16]=16d, final answer=softmax( hidden_final*W )=10d


Ideas:
  DONE : Add a 'zeroes'/'zeroes' key/value 'entity' to allow for non-attentive states
  DONE : Add second RNN layer in initial stream formation
  DONE : Add a 'trainable'/'zeroes' key/value 'entity' to allow for non-attentive states
  
  Teach each question type in turn, building up the curriculum

  EITHER : Force stream to end in 'STOP' state (look for size of value<eps)
  OR     : Allow stream to drift on-and-on, whereas the 'evaluator' is incentivised to declare early

  Investigate processing
    DONE In first batch, find 1 example of each question type
    DONE Have a 'debug mode' that saves the internal state(s) during the processing of a batch
    DONE Check on attention thing
    Look at the stats of individual samples within the batch 

  Postional information processing:
    DONE  : Add 1-d convolution layers on top of image to allow for more 'integration' of the coordinates
    LATER : Instead of later in hierarchy, add the positional information as extra input layer (so R,G,B,x,y)
  
  Try Recurrent Additive Network units instead of GRU?
    https://github.com/kentonl/ran (paper authors' TF version)
    https://github.com/bheinzerling/ran/blob/master/ran.py  ('not up to date' PyTorch version)
  Or QRNN (drop-in compatible, apparently) 
    https://github.com/salesforce/pytorch-qrnn
    
  Add another set of questions that are 'too tricky' for RN to compute (turns out to be v. difficult...)
    NOPE :  Are 3 listed colours in clockwise order?
    NOPE :  Do 3 listed colours form a large triangle?
    NOPE :  What is the shape of the outlier of these three colours?
    DONE :  How many other shapes are colinear with these 2?
    DONE :  How many other shapes are equidistant from these 2?
    DONE :  How many are below the line joining these 2?
  Check the analysis of specific cases on the trained RFS network

  Once-only Retrieval : Keep a running total of the softmax inputs (starting at zero), and subtract from current query

  Create a hard-attention option to use instead of softmax 
  Like 'dropout', use it in a fixed proportion during training, and solely during testing
    Problem with using a 'plain' temperature parameter on top of softmax is that weights will just scale down

  Highway network on output GRUs (i.e transform hidden state a bit more before using it)
    DONE  No benefit found when added onto stream gru hidden states to produce queries
    ?? Beneficial to add to end of answer gru hidden states
  
  TEST : Add ReLU to answer output before softmax
    
  Instead of 'answer' being softmax of hidden state
    softmax when at 'stop' state
    softmax over a Highway network of hidden state ( p3 of https://arxiv.org/pdf/1508.06615.pdf )
      z = t # g(WH.y + bH) + (1 − t) # y  ( where t = σ(WT.y + bT ), and bT~(-2) )
    softmax over a 'element-wise max' of (possibly transformed) hidden states across time
  
  Check out "Self-critical Sequence Training for Image Captioning"
    https://github.com/ruotianluo/self-critical.pytorch
    RewardFn
      https://github.com/ruotianluo/self-critical.pytorch/blob/master/misc/rewards.py#L31
    model.sample
      https://github.com/ruotianluo/self-critical.pytorch/blob/master/models/FCModel.py#L153
    torch.multinomial : Just the function we need...  it=index_timet, xt=x_timet
      https://github.com/ruotianluo/self-critical.pytorch/blob/master/models/FCModel.py#L179
    similarly : 
      https://github.com/ruotianluo/self-critical.pytorch/blob/master/models/AttModel.py#L87
      
      
  Other PyTorch example code
    Implicit teacher forcing (i.e. not relevant here)
      https://github.com/pytorch/tutorials/blob/master/intermediate_source/char_rnn_generation_tutorial.py#L277

  **Should read : https://reinforce.io/blog/end-to-end-computation-graphs-for-reinforcement-learning/
  
  **Experiment
    Numeric input 0,1,2,3,4 one-hot encoded maps to same as output
      Input -> Linear -> SoftMax -> X -> Output
      Vary X so that soft vs hard attention can be examined, etc


  Vision / saccades : 
    http://bair.berkeley.edu/blog/2017/11/09/learn-to-attend-fovea/


# 760 GTX           : 2258 GFLOPS  # https://en.wikipedia.org/wiki/GeForce_700_series
# Titan X (Maxwell) : 6144 GFLOPS  # https://en.wikipedia.org/wiki/GeForce_900_series

https://nips.cc/Conferences/2017/Schedule?type=Workshop
  Hierarchical RL Workshop
    https://sites.google.com/view/hrlnips2017/call-for-papers
  Learning Disentangled Representations
    https://sites.google.com/view/disentanglenips2017

  Visually-Grounded Interaction and Language (ViGIL) # Confirmed that there's no need to 'blind' the submission on/by 17-Nov
    https://nips2017vigil.github.io/
    2017-11-26 : Accepted!  
    TODO: 
      Build poster-sized version
      Make this repo 'runnable' to enable replication of results
      Explore additional domains to run tests on
      Make sort-of-clevr generator more 'robust' (margins between results - to match original RN diagrams)
      Look at code in RFESH and check whether there were last minute changes 
      Validate all runs