-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes.txt
275 lines (219 loc) · 11.6 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
### ImageNet as an attention game
2017 overview
https://medium.com/towards-data-science/visual-attention-model-in-deep-learning-708813c2912c
https://github.com/tianyu-tristan/Visual-Attention-Model
Learning to combine foveal glimpses with a third-order Boltzmann machine
https://papers.nips.cc/paper/4089-learning-to-combine-foveal-glimpses-with-a-third-order-boltzmann-machine
Hugo Larochelle
Geoffrey Hinton
Issues:
RBM approach
Very retina-like field of view
Learning where to Attend with Deep Architectures for Image Tracking
https://arxiv.org/abs/1109.3737
v. early
More like tracking
Some Bayesian elements
Misha Denil
Loris Bazzani
Hugo Larochelle
Nando de Freitas
Issues:
Recurrent Models of Visual Attention (DeepMind)
https://arxiv.org/abs/1406.6247
Volodymyr Mnih *
Nicolas Heess
Alex Graves
Koray Kavukcuoglu
Blog Post:
http://torch.ch/blog/2015/09/21/rmva.html
This shows that each 8x8 pixel glimpse of 28x28 MNIST is individually tricky
Issues:
Glimpse size can be quite large
28x28 regular MNIST sampled with 1 to 7 steps of 8x8 patches
60x60 translated MNIST sampled with 1 to 3 steps of 8x8 patches, but covering up to 8*2*2=32x32
p6 claims that a single 8x8 is insufficient to classify MNIST digits
p9 shows that digits are clearly recognisable from 1 3-step glimpse patch
On Learning Where To Look
https://arxiv.org/abs/1405.5488
Marc'Aurelio Ranzato (Google)
Issues:
1. train N0 and N1
2. train N2, fixing N0 and N1
3. train N3, fixing N0, N1 and N2
Help from : Hinton (xStudent?)
Multiple Object Recognitions with Visual Attention (ICLR 2015, DeepMind)
https://arxiv.org/pdf/1412.7755.pdf
Jimmy Lei Ba
Volodymyr Mnih *
Koray Kavukcuoglu
Blog Post:
https://netsprawl.wordpress.com/2016/07/26/recurrent-attention/
https://github.com/jrbtaylor/visual-attention
Ultimately, I decided to abandon this track of research.
There may be some applications with extremely large images,
like microscopy, where a hard attention mechanism is necessary for now (until GPU memory can hold the images),
but otherwise the policy learning is so much slower than the convnet that the trade-off never works out.
Issues:
SVHN dataset (multiple digits recognised)
Attention model took ~3 days on 'a' GPU
Help from : Geoffrey Hinton, Nando de Freitas and Chris Summerfield
Spatial Transformer Networks (DeepMind)
Paper:
https://arxiv.org/abs/1506.02025
https://arxiv.org/pdf/1506.02025.pdf
Max Jaderberg
Karen Simonyan
Andrew Zisserman
Koray Kavukcuoglu
Blog posts:
https://kevinzakka.github.io/2017/01/10/stn-part1/
https://kevinzakka.github.io/2017/01/18/stn-part2/
Implementations:
https://github.com/qassemoquab/stnbhwd (torch)
https://github.com/kevinzakka/spatial_transformer_network (TF)
Attend, Infer, Repeat: Fast Scene Understanding with Generative Models
https://arxiv.org/pdf/1603.08575.pdf
S. M. Ali Eslami
Nicolas Heess
Theophane Weber
Yuval Tassa
David Szepesvari
Koray Kavukcuoglu
Geoffrey E. Hinton
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
https://arxiv.org/pdf/1502.03044.pdf
Kelvin Xu
Jimmy Lei Ba (DeepMind?)
Ryan Kiros
Kyunghyun Cho
Aaron Courville *
Ruslan Salakhutdinov *
Richard S. Zemel
Yoshua Bengio *
https://github.com/atulkum/paper_implementation in TensorFlow
Recurrent Models of Visual Attention https://arxiv.org/abs/1406.6247
Multiple Object Recognition with Visual Attention https://arxiv.org/abs/1412.7755
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention https://arxiv.org/abs/1502.03044
Recurrent Attention in Computer Vision
https://netsprawl.wordpress.com/2016/07/26/recurrent-attention/
https://github.com/jrbtaylor/visual-attention (unfinished)
## Implementation :
Let's create a X/Y coordinate thing (2d)
The 'keys' for the image will be the first 10 of the 24 /concat with/ the coordinate thing
The 'values' for the image will be the second 14 of the 24 /concat with/ the coordinate thing
So the attention query vector (output) will need to be n_q=12 wide, and will retrieve a n_v=16 wide value
To get the attention weights, create a dot-product of the k&q numbers
* Option 1 : Softmax these.
* Option 2 : use a temperature parameter on this vector Gumbel-SoftMax (Google ICLR-2017)
* Option 3 : Harden the Gumbel-SoftMax outputs for testing (and training?)
https://arxiv.org/abs/1611.01144
Categorical Reparameterization with Gumbel-Softmax
This distribution has the essential property that it can be smoothly annealed into a categorical distribution.
Use this as an 'action' when temperature -> 0
See also :
https://casmls.github.io/general/2017/02/01/GumbelSoftmax.html
Since we don't know what 'language' will be constructed for the internal dialogue,
we can't do teacher-forcing. Will have to go the slower route...
Similarly, using a dilated-CNN doesn't make sense, since part of the advantage is when doing forcing
So, the input to the LSTM thing at the bottom will be n_v-wide
Initial input should be (say) all-zero
output stage to the LSTM thing at the top will be n_q-wide
Maximum length of objects to be processed is ~6.
So make LSTM stage 8 long
#1 layer version
# 1 layer of LSTM. n_v input size, hidden_size = (question_size+answer_size), n_q output size
# - hidden state initialised with (question + trainable_vector(answer_size))
# Output 'answer' corresponds to answer_size portion of hidden units
#2 layer version
# 2 layers of LSTM. n_v input size, hidden_size_1 = question_size, hidden_size_2 = answer_size, n_q output size
# - bottom one initialised with question - final output is ignored
# - top one initialised with zero - and outputs answer
# Output 'answer' should be hidden units of LSTM layer 2
Basic Query IO :
question=11d
answer=10d
Two major layers :
First RNN : (The internal dialogue stage)
inputs = 'vs(16d)' from image, hidden0=question[0:11]+0s[0:5]=16d, output = hidden*W = qs(12d) -> softmax -> weighted vs for next step
Second RNN : (The interpreting the dialogue stage - in two stages)
inputs = outputs from previous layer, hidden0=16d=question[0:11]+0s[0:5]=16d
inputs = hiddens from previous layer, hidden0=0s[0:16]=16d, final answer=softmax( hidden_final*W )=10d
Ideas:
DONE : Add a 'zeroes'/'zeroes' key/value 'entity' to allow for non-attentive states
DONE : Add second RNN layer in initial stream formation
DONE : Add a 'trainable'/'zeroes' key/value 'entity' to allow for non-attentive states
Teach each question type in turn, building up the curriculum
EITHER : Force stream to end in 'STOP' state (look for size of value<eps)
OR : Allow stream to drift on-and-on, whereas the 'evaluator' is incentivised to declare early
Investigate processing
DONE In first batch, find 1 example of each question type
DONE Have a 'debug mode' that saves the internal state(s) during the processing of a batch
DONE Check on attention thing
Look at the stats of individual samples within the batch
Postional information processing:
DONE : Add 1-d convolution layers on top of image to allow for more 'integration' of the coordinates
LATER : Instead of later in hierarchy, add the positional information as extra input layer (so R,G,B,x,y)
Try Recurrent Additive Network units instead of GRU?
https://github.com/kentonl/ran (paper authors' TF version)
https://github.com/bheinzerling/ran/blob/master/ran.py ('not up to date' PyTorch version)
Or QRNN (drop-in compatible, apparently)
https://github.com/salesforce/pytorch-qrnn
Add another set of questions that are 'too tricky' for RN to compute (turns out to be v. difficult...)
NOPE : Are 3 listed colours in clockwise order?
NOPE : Do 3 listed colours form a large triangle?
NOPE : What is the shape of the outlier of these three colours?
DONE : How many other shapes are colinear with these 2?
DONE : How many other shapes are equidistant from these 2?
DONE : How many are below the line joining these 2?
Check the analysis of specific cases on the trained RFS network
Once-only Retrieval : Keep a running total of the softmax inputs (starting at zero), and subtract from current query
Create a hard-attention option to use instead of softmax
Like 'dropout', use it in a fixed proportion during training, and solely during testing
Problem with using a 'plain' temperature parameter on top of softmax is that weights will just scale down
Highway network on output GRUs (i.e transform hidden state a bit more before using it)
DONE No benefit found when added onto stream gru hidden states to produce queries
?? Beneficial to add to end of answer gru hidden states
TEST : Add ReLU to answer output before softmax
Instead of 'answer' being softmax of hidden state
softmax when at 'stop' state
softmax over a Highway network of hidden state ( p3 of https://arxiv.org/pdf/1508.06615.pdf )
z = t # g(WH.y + bH) + (1 − t) # y ( where t = σ(WT.y + bT ), and bT~(-2) )
softmax over a 'element-wise max' of (possibly transformed) hidden states across time
Check out "Self-critical Sequence Training for Image Captioning"
https://github.com/ruotianluo/self-critical.pytorch
RewardFn
https://github.com/ruotianluo/self-critical.pytorch/blob/master/misc/rewards.py#L31
model.sample
https://github.com/ruotianluo/self-critical.pytorch/blob/master/models/FCModel.py#L153
torch.multinomial : Just the function we need... it=index_timet, xt=x_timet
https://github.com/ruotianluo/self-critical.pytorch/blob/master/models/FCModel.py#L179
similarly :
https://github.com/ruotianluo/self-critical.pytorch/blob/master/models/AttModel.py#L87
Other PyTorch example code
Implicit teacher forcing (i.e. not relevant here)
https://github.com/pytorch/tutorials/blob/master/intermediate_source/char_rnn_generation_tutorial.py#L277
**Should read : https://reinforce.io/blog/end-to-end-computation-graphs-for-reinforcement-learning/
**Experiment
Numeric input 0,1,2,3,4 one-hot encoded maps to same as output
Input -> Linear -> SoftMax -> X -> Output
Vary X so that soft vs hard attention can be examined, etc
Vision / saccades :
http://bair.berkeley.edu/blog/2017/11/09/learn-to-attend-fovea/
# 760 GTX : 2258 GFLOPS # https://en.wikipedia.org/wiki/GeForce_700_series
# Titan X (Maxwell) : 6144 GFLOPS # https://en.wikipedia.org/wiki/GeForce_900_series
https://nips.cc/Conferences/2017/Schedule?type=Workshop
Hierarchical RL Workshop
https://sites.google.com/view/hrlnips2017/call-for-papers
Learning Disentangled Representations
https://sites.google.com/view/disentanglenips2017
Visually-Grounded Interaction and Language (ViGIL) # Confirmed that there's no need to 'blind' the submission on/by 17-Nov
https://nips2017vigil.github.io/
2017-11-26 : Accepted!
TODO:
Build poster-sized version
Make this repo 'runnable' to enable replication of results
Explore additional domains to run tests on
Make sort-of-clevr generator more 'robust' (margins between results - to match original RN diagrams)
Look at code in RFESH and check whether there were last minute changes
Validate all runs