VAE Project

Purpose

Test out if it is possible to make VAE’s each learn to recognize a different object.

For example, let us consider the MNIST dataset restricted only to the digits 0 and 1. We would like to have 2 VAE’s, and have each one of them learn to represent the digit 0, and the other one learn to represent the digit 1.

Tasks

Implement Beta-VAE.

Figure out why running on two datasets causes the error.

`tf.function-decorated function tried to create variables on non-first call.`

Implement entropy loss.

Do a pixel-wise entropy function (i.e., for each pixel consider that the VAE’s give a probability distribution, and compute the entropy for each pixel).

Then take the sum of all entropies, and add Alpha * log(S) to the loss function, where Alpha is a hyperparameter.

Keep in mind that maybe we have to normalize the Entropy term depending on the number of VAE’s.

Implement VAE’s for 3 digits.

Figure out what is the best way to freeze a VAE

One way would be to make the learning rate very small, so that nothing changes.

Solved issues

The confidences for both models always look like coffee beans.

Is this ok? A: If you decrease the entropy loss coefficient, the confidences start becoming more descriptive.

VAE_1 learns both 0 and 1.

This seems to be unavoidable, as encoding the digit iself takes only 1 bit.

Maybe an idea would be to make the models so weak, that they are unable to

learn two digits at the same time. A: This does not work, because weak models just split the work of drawing a 0.

Make VAE_1 learn a residual between the input picture and what VAE_0 predicts.

This should have the effect that it does not even have the chance to learn what a 0 is. A: This is not that useful, because residuals impose a sequence-dependence on the order of learning.

Model almost working!

Gamma=0.005

! python3 train.py --name colab --beta 1 --gamma 0.005 --epochs 40 80 --latent_dim 8 --nlayers 3 VAE_1 seems to ‘barely’ learn anything about zeroes, in that it draws much in a much uglier way than the 1’s. Furthermore, the confidences of VAE_1 for the zeroes are very very low (almost black). This might mean two things:

The entropy loss is a little bit too high, and so VAE_1 is forced to learn about zeroes only to insure that there is not too much entropy loss incurred.
VAE_1 has to much available entropy, and decided to spend some of it on the wrong digit.

Gamma=0.0002

When gamma is too small, VAE_0 has very high confidences where there is a 0, as well as where there is nothing. In accordance, VAE_1 either predicts 1’s where they actually exist, or it puts a very low confidence very generic 0 everywhere else. ! python3 train.py --name colab --beta 1 --gamma 0.0002 --epochs 40 80 --latent_dim 8 --nlayers 3

Make ReLU after BN and fix on this architecture.

Try to fully freeze models.

I.e., not apply any training at all.

Try to let VAE_0 train for longer.

Also plot how it fares when dealing with pictures of 0 and 1, to see what happens.

Try to see if there are bugs.

Maybe VAE_0 is not properly frozen.

Network params results

These are all done with ReLU before BN, with no FC.

Beta	Gamma	Good/(Good+Bad)	Obs.
1.0	0.0005	1/2	In bad, VAE_0 dominates.
2.0	0.001	2/5	In bad, VAE_0 dominates.
2.0	0.002	1/4
2.0	0.005	3/6	In bad, VAE_1 collapses.
1.0	0.0001	0/2 bad	All white.
1.0	0.001	0/2 bad
1.0	0.005	0/2 bad
2.0	0.0001	0/2 bad	All white.
2.0	0.0005	0/2 bad	VAE_0 too confident

ReLU before BN, with FC (but no activation):

Beta	Gamma	Good/(Good+Bad)	Obs.
2	0.001	3/4	When bad, VAE_0 dominates.
2	0.002	0/4
2	0.005	3/4

Network Architecture results

ReLU / SeLU	Act. before/after BN	FC at end	Works?
ReLU	Before	No	5 Yes, 0 No
ReLU	Before	Yes	3 Yes, 1 No
SeLU	No BN	Yes	Yes
SeLU	No BN	No	Yes
ReLU	After	No	No
ReLU	After	Yes	No
ReLU	No	?	No

As a result, there is a single architecture which seems most likely to work: ReLU, act. before BN, no FC at the end.

VAE_1 may collapse.

Occasionally, VAE_1 will not learn anything. As soon as it starts training, its KL-loss becomes 0 and stays 0. This may be because the KL loss for VAE_0 will be fixed and cannot change, and hence maybe not much is left over for VAE_1.

One issue: depending on Β, maybe VAE_0 “gobbles up” all of the available information. This way, when VAE_1 starts learning, it cannot learn anything because doing so would incur a pretty hefty KL-loss penalty.

See this paper https://arxiv.org/pdf/1808.04947.pdf for possible solutions.

Vanishing gradients

It seems that when training multiple VAE’s, eventually we run into the problem of vanishing gradients. Possible solutions: different activations?

Collapsing becomes a big issue when training with multiple VAE’s.

Since it happens randomly also with only 2, that should hopefully be solved before we start doing anything else.

Add batch normalization to the confidence values for each VAE.

This way all VAE’s will produce confidences within the same ballpark values, so there is no more overpowering by the early VAE’s who get a chance to up their confidences really really high.

Plot KL for fixed image, to see if it activated or not.

Scale to harder problems:

Have more digits.

Try to see why the KL of VAE-0 is higher when there is no 0.

Try to train all VAE’s together at the end for some time, with a lower lr.

Try to also increase KL loss for this scenario.

Try to maybe also feed empty blocks very bright.

List of hyperparams which produce satisfactory results:

  (defun run-experiment-with-params (root-dir block-name)
  (setq root_dir root-dir)
  (save-excursion
    (goto-char
      (org-babel-find-named-block block-name))
    (org-babel-execute-src-block-maybe))
  )
(setq digits "33")
(setq epoch "latest")

for row in table[1:]:
    if row is None:
      continue
    beta = row[0]
    gamma = row[1]
    run = row[2]
    root_dir = f'../_save/{base_dir}/beta={beta}_gamma={gamma}/run-{run}'
    row[3] = f'[[elisp:(run-experiment-with-params "{root_dir}" "sample-experiment")][click]]'
return table

Beta	Gamma	Run	Link
0.5	0.007	1	click
0.5	0.01	1	click
0.5	0.02	1	click
0.7	0.007	1	click
0.7	0.01	1	click
0.7	0.02	1	click

echo "digits = ${digits}"
echo "epoch = ${epoch}"
echo "${root_dir}"
python3 sample.py --name leonhard --digits "${digits}" --root-dir "${root_dir}" --num-examples 4 --epoch "${epoch}"

Sometimes later models still collapse

Beta	Gamma
0.7	0.005

Sometimes KL does not change if model is doing something

Beta	Gamma
0.9	0.005

In this case, this happens for VAE-3.

Investigate why it cannot learn the 3’s – on the latest runs.

Maybe also look at how the loss progresses.

What is the loss difference between failed models and good models?

Connection to disentanglement

Isn’t what we are doing just a “forced” version of disentanglement? Since a fully disent. model will have one component of the latent variable which controls the object type, our model just seems to produce results where the object type latent variable is forced to be disentangled, via the separation of different VAE’s. In other words, since our purpose is to have one VAE learn one object, this is the same as splitting one fully-disentangled VAE model into N differente VAE’s, where each one has the object-type latent var. fixed to one of the objects.

Maybe switch to 3D and then try less supervision

It seems that most approaches which perform well on MNIST do not actually generalize well to other approaches. With that in mind, it might be better to first transition to a more realistic dataset, and only then try to achieve “true” supervision.

Train all at once, and the ones with high KL loss get frozen.

Try to integrate with another dataset, like fashion MNIST.

See if one model is capable enough of handling it.

Try to see how to make the model have a lower loss if it does not do anything.

Maybe this actually does not really matter.

Results for training all together

VAE for 4 does not seem to learn anything: VAE_2 learns the digit 3 as well as the digit 2. All of them seem to learn digits 00 quite well though. Try this with digits set to 33.

Beta	Gamma	Run	Link
0.5	0.005	1	click
0.5	0.005	2	click
0.7	0.005	1	click
0.7	0.005	2	click
0.7	0.007	1	click
0.7	0.007	2	click
0.7	0.01	1	click
0.5	0.007	1	click
0.5	0.01	1	click

VAE’s seem to perform well, but only on certain runs (not all): It is interesting to note that they perform well with digits set to 00

Beta	Gamma	Run	Link
0.7	0.01	2	click
0.5	0.007	2	click
0.5	0.01	2	click

During training, if ran on earlier epochs, it seems they do not behave as expected. After training 0 and 1 together, we would expect 1 to not react to the digit 0 anymore. However, it seems that, for small values of beta (i.e., up to around 1.1), it still encodes information about the digit 0. Something strange happens at epoch 1000 though: somehow all of them (except 0) learn to not encode digit 0 anymore.

This phenomena happens with the digit 0, though. For 1, they seem to correctly learn to not output anything.

For 2, a similar problem as with 0 occurs (but only with some models).

Epoch	What?
160	0
270	1
380	0 + 1
520	2
660	0 + 1 + 2
830	3
1000	0 + 1 + 2 + 3

Beta	Gamma	Run	Link	Comments
1.2	0.007	1	click	2 is meh-ok.
1.2	0.007	2	click	2 is ok.
1.2	0.009	1	click	2 is meh.
1.2	0.009	3	click	2 is ok.
1.2	0.01	1	click	2 is ok.
1.2	0.01	2	click	2 is ok.
1.2	0.01	3	click	2 is meh.

Implement spatial broadcast decoder.

Make a metric to see how well the models are splitting.

Compute how many pixels are modelled (via the softmax conf.) by the good pixel.

Maybe use the ARI one from the IODIDE paper.

Truncated normal weight initializer

Seems to be used everywhere.

Check to see what happens if the background is complicated.

Should be the case in the CLEVR dataset.

MONet Paper comments

Their hypothesis is that it is easier for a model to process a scene if there are repeating patterns.

For example, if the same time of object appears multiple times, then it should be easier to model all of them at once.

They test the hypothesis by training with fixed masks, and it seems to be the case indeed.

Maybe we could try the same thing, by forcing the softmaxed masks of the VAE’s to take certain values.

One possible reason why models maybe tend to take over is that the scenes are so simple, that a single model is capable of representing everything, and so there is no “incentive” to share the load.

They only use uniformly coloured backgrounds.

Why might this be? Does the model have problems in other scenarios?

Next runs

Epoch	What?
160	0
265	1
370	0 + 1
530	2
690	0 + 1 + 2
935	3
1180	0 + 1 + 2 + 3

Beta	Gamma	Run	Link	Obs.
0.5	0.009	2	click	split
0.5	0.011	1	click	split
0.5	0.013	3	click	split
0.5	0.015	2	click	split
0.7	0.009	1	click	takeover
0.7	0.009	2	click	split
0.7	0.009	3	click	split
0.9	0.011	1	click	equal split
0.9	0.011	2	click	equal split
0.9	0.011	3	click	takeover
0.9	0.013	1	click	equal split
0.9	0.013	2	click	takeover
0.9	0.013	3	click	bad split

The loss does indeed go down over time, but it seems that it is optimal for model 2 to learn the digit 3, even if model 3 has a head start.

Even over 0’s it seems to wake up a little bit, albeit with not very high confidence.

When beta is 0.5, it seems that almost always the digit 3 is split. However, for example at beta = 0.9, sometimes they split and sometimes model 2 takes over. When they split, the loss is about 23. When they don’t, it is about 25, which seems right.

One potential issue is that when model 3 trains only by itself, its confidence at the end is not always very high. On the other hand, when model 2 trains by itself on digit 2, its confidence is always very very high.

Confidences problem?

Maybe when model 3 finishes training by itself, it did not yet have a chance to get really confident about the digit 3, whereas model 2 may already be confident from before and thus have more gradients flowing to it. As a result, it trains faster than model 3, and thus learns the 3’s as well.

One issue with this hypothesis is why is it always model 2 the one that takes over, and not one of the others? Is it because the digit 2 is the closest one to a 3, or is it because it is the one trained right beforehand?

In the worst case, it seems that they split the digit, instead of only one of them learning it.

The difference from 2 is that when model 2 finishes training, it is already very confident in its own digits. As such, when they all train together, it is only natural for it to take over. On the other and, it seems that after model 3 finishes training, it is still “meh” in regards to confidence. As such, since it seems to split the confidence with model 2 right from the start, both of them train together (in the good case). In the bad case, model 2 just overtakes it completely.

Randomized training order?

Maybe the issue is that digit 2 looks maybe more similar to digit 3, and so model 2 already has some knowhow about 3’s. One interesting idea may be to randomize the digits that we use when training. For example, model 0 may learn digit 9, model 1 maybe learns 5, and so on. In this way we check whether it is the digit’s similarity, or if this always happens.

IODIDE Paper comments

The MSE loss function in our model is actually the same as they present!

Stashed ideas

Decrease Gamma and increase the KL-loss while training VAE_1,

in order to encourage it to learn one single thing, and learn it well. However, since VAE_0 is not learning anything anymore, maybe we should also decrease the KL-loss weight.

After VAE_0 has learned its digit, find out the KL loss. Then try to force VAE_1 to have

a similar KL loss, by using the Beta-VAE paper trick.

Add loss for generating images: if you decide to output a non-trivial pixel, then you should

be very confident in your prediction. Another idea in a similar fashion: if you output an image but you have low confidence, don’t even bother.

Next issues

Maybe try different loss.

It does not seem to perform that well; maybe there is a bug in the implementation.

Update with results from fashion mnist.

Spatial Broadcast Decoder

It gives much better results than normal deconv.

Maybe our current approach cannot work.

My intuition is that the VAE’s should all be trained in parallel. In the 3D env, maybe a cylinder is so similar to a box, that the same VAE will model both.

Files

notes.org

Latest commit

History