Test out if it is possible to make VAE’s each learn to recognize a different object.
For example, let us consider the MNIST dataset restricted only to the digits 0 and 1. We would like to have 2 VAE’s, and have each one of them learn to represent the digit 0, and the other one learn to represent the digit 1.
`tf.function-decorated function tried to create variables on non-first call.`Do a pixel-wise entropy function (i.e., for each pixel consider that the VAE’s give a probability distribution, and compute the entropy for each pixel).
Then take the sum of all entropies, and add Alpha * log(S) to the loss function, where Alpha is a hyperparameter.
Is this ok? A: If you decrease the entropy loss coefficient, the confidences start becoming more descriptive.
This seems to be unavoidable, as encoding the digit iself takes only 1 bit.
learn two digits at the same time. A: This does not work, because weak models just split the work of drawing a 0.
This should have the effect that it does not even have the chance to learn what a 0 is. A: This is not that useful, because residuals impose a sequence-dependence on the order of learning.
! python3 train.py --name colab --beta 1 --gamma 0.005 --epochs 40 80 --latent_dim 8 --nlayers 3
VAE_1 seems to ‘barely’ learn anything about zeroes, in that it draws much in a much uglier way than the 1’s.
Furthermore, the confidences of VAE_1 for the zeroes are very very low (almost black).
This might mean two things:
- The entropy loss is a little bit too high, and so VAE_1 is forced to learn about zeroes only to insure that there is not too much entropy loss incurred.
- VAE_1 has to much available entropy, and decided to spend some of it on the wrong digit.
When gamma is too small, VAE_0 has very high confidences where there is a 0, as well as where there is nothing.
In accordance, VAE_1 either predicts 1’s where they actually exist, or it puts a very low confidence very generic 0
everywhere else.
! python3 train.py --name colab --beta 1 --gamma 0.0002 --epochs 40 80 --latent_dim 8 --nlayers 3
I.e., not apply any training at all.
Also plot how it fares when dealing with pictures of 0 and 1, to see what happens.
Maybe VAE_0 is not properly frozen.
These are all done with ReLU before BN, with no FC.
Beta | Gamma | Good/(Good+Bad) | Obs. |
---|---|---|---|
1.0 | 0.0005 | 1/2 | In bad, VAE_0 dominates. |
2.0 | 0.001 | 2/5 | In bad, VAE_0 dominates. |
2.0 | 0.002 | 1/4 | |
2.0 | 0.005 | 3/6 | In bad, VAE_1 collapses. |
1.0 | 0.0001 | 0/2 bad | All white. |
1.0 | 0.001 | 0/2 bad | |
1.0 | 0.005 | 0/2 bad | |
2.0 | 0.0001 | 0/2 bad | All white. |
2.0 | 0.0005 | 0/2 bad | VAE_0 too confident |
ReLU before BN, with FC (but no activation):
Beta | Gamma | Good/(Good+Bad) | Obs. |
---|---|---|---|
2 | 0.001 | 3/4 | When bad, VAE_0 dominates. |
2 | 0.002 | 0/4 | |
2 | 0.005 | 3/4 |
ReLU / SeLU | Act. before/after BN | FC at end | Works? |
---|---|---|---|
ReLU | Before | No | 5 Yes, 0 No |
ReLU | Before | Yes | 3 Yes, 1 No |
SeLU | No BN | Yes | Yes |
SeLU | No BN | No | Yes |
ReLU | After | No | No |
ReLU | After | Yes | No |
ReLU | No | ? | No |
As a result, there is a single architecture which seems most likely to work: ReLU, act. before BN, no FC at the end.
Occasionally, VAE_1 will not learn anything. As soon as it starts training, its KL-loss becomes 0 and stays 0. This may be because the KL loss for VAE_0 will be fixed and cannot change, and hence maybe not much is left over for VAE_1.
One issue: depending on Β, maybe VAE_0 “gobbles up” all of the available information. This way, when VAE_1 starts learning, it cannot learn anything because doing so would incur a pretty hefty KL-loss penalty.
See this paper https://arxiv.org/pdf/1808.04947.pdf for possible solutions.
It seems that when training multiple VAE’s, eventually we run into the problem of vanishing gradients. Possible solutions: different activations?
Since it happens randomly also with only 2, that should hopefully be solved before we start doing anything else.
This way all VAE’s will produce confidences within the same ballpark values, so there is no more overpowering by the early VAE’s who get a chance to up their confidences really really high.
(defun run-experiment-with-params (root-dir block-name)
(setq root_dir root-dir)
(save-excursion
(goto-char
(org-babel-find-named-block block-name))
(org-babel-execute-src-block-maybe))
)
(setq digits "33")
(setq epoch "latest")
for row in table[1:]:
if row is None:
continue
beta = row[0]
gamma = row[1]
run = row[2]
root_dir = f'../_save/{base_dir}/beta={beta}_gamma={gamma}/run-{run}'
row[3] = f'[[elisp:(run-experiment-with-params "{root_dir}" "sample-experiment")][click]]'
return table
Beta | Gamma | Run | Link |
---|---|---|---|
0.5 | 0.007 | 1 | click |
0.5 | 0.01 | 1 | click |
0.5 | 0.02 | 1 | click |
0.7 | 0.007 | 1 | click |
0.7 | 0.01 | 1 | click |
0.7 | 0.02 | 1 | click |
echo "digits = ${digits}"
echo "epoch = ${epoch}"
echo "${root_dir}"
python3 sample.py --name leonhard --digits "${digits}" --root-dir "${root_dir}" --num-examples 4 --epoch "${epoch}"
Beta | Gamma |
---|---|
0.7 | 0.005 |
Beta | Gamma |
---|---|
0.9 | 0.005 |
In this case, this happens for VAE-3.
Isn’t what we are doing just a “forced” version of disentanglement? Since a fully disent. model will have one component of the latent variable which controls the object type, our model just seems to produce results where the object type latent variable is forced to be disentangled, via the separation of different VAE’s. In other words, since our purpose is to have one VAE learn one object, this is the same as splitting one fully-disentangled VAE model into N differente VAE’s, where each one has the object-type latent var. fixed to one of the objects.
It seems that most approaches which perform well on MNIST do not actually generalize well to other approaches. With that in mind, it might be better to first transition to a more realistic dataset, and only then try to achieve “true” supervision.
Maybe this actually does not really matter.
VAE for 4 does not seem to learn anything: VAE_2 learns the digit 3 as well as the digit 2. All of them seem to learn digits 00 quite well though. Try this with digits set to 33.
Beta | Gamma | Run | Link |
---|---|---|---|
0.5 | 0.005 | 1 | click |
0.5 | 0.005 | 2 | click |
0.7 | 0.005 | 1 | click |
0.7 | 0.005 | 2 | click |
0.7 | 0.007 | 1 | click |
0.7 | 0.007 | 2 | click |
0.7 | 0.01 | 1 | click |
0.5 | 0.007 | 1 | click |
0.5 | 0.01 | 1 | click |
VAE’s seem to perform well, but only on certain runs (not all): It is interesting to note that they perform well with digits set to 00
Beta | Gamma | Run | Link |
---|---|---|---|
0.7 | 0.01 | 2 | click |
0.5 | 0.007 | 2 | click |
0.5 | 0.01 | 2 | click |
During training, if ran on earlier epochs, it seems they do not behave as expected. After training 0 and 1 together, we would expect 1 to not react to the digit 0 anymore. However, it seems that, for small values of beta (i.e., up to around 1.1), it still encodes information about the digit 0. Something strange happens at epoch 1000 though: somehow all of them (except 0) learn to not encode digit 0 anymore.
This phenomena happens with the digit 0, though. For 1, they seem to correctly learn to not output anything.
For 2, a similar problem as with 0 occurs (but only with some models).
Epoch | What? |
---|---|
160 | 0 |
270 | 1 |
380 | 0 + 1 |
520 | 2 |
660 | 0 + 1 + 2 |
830 | 3 |
1000 | 0 + 1 + 2 + 3 |
Beta | Gamma | Run | Link | Comments |
---|---|---|---|---|
1.2 | 0.007 | 1 | click | 2 is meh-ok. |
1.2 | 0.007 | 2 | click | 2 is ok. |
1.2 | 0.009 | 1 | click | 2 is meh. |
1.2 | 0.009 | 3 | click | 2 is ok. |
1.2 | 0.01 | 1 | click | 2 is ok. |
1.2 | 0.01 | 2 | click | 2 is ok. |
1.2 | 0.01 | 3 | click | 2 is meh. |
Maybe use the ARI one from the IODIDE paper.
Seems to be used everywhere.
Should be the case in the CLEVR dataset.
Their hypothesis is that it is easier for a model to process a scene if there are repeating patterns.
For example, if the same time of object appears multiple times, then it should be easier to model all of them at once.
Maybe we could try the same thing, by forcing the softmaxed masks of the VAE’s to take certain values.
One possible reason why models maybe tend to take over is that the scenes are so simple, that a single model is capable of representing everything, and so there is no “incentive” to share the load.
Why might this be? Does the model have problems in other scenarios?
Epoch | What? |
---|---|
160 | 0 |
265 | 1 |
370 | 0 + 1 |
530 | 2 |
690 | 0 + 1 + 2 |
935 | 3 |
1180 | 0 + 1 + 2 + 3 |
Beta | Gamma | Run | Link | Obs. |
---|---|---|---|---|
0.5 | 0.009 | 2 | click | split |
0.5 | 0.011 | 1 | click | split |
0.5 | 0.013 | 3 | click | split |
0.5 | 0.015 | 2 | click | split |
0.7 | 0.009 | 1 | click | takeover |
0.7 | 0.009 | 2 | click | split |
0.7 | 0.009 | 3 | click | split |
0.9 | 0.011 | 1 | click | equal split |
0.9 | 0.011 | 2 | click | equal split |
0.9 | 0.011 | 3 | click | takeover |
0.9 | 0.013 | 1 | click | equal split |
0.9 | 0.013 | 2 | click | takeover |
0.9 | 0.013 | 3 | click | bad split |
The loss does indeed go down over time, but it seems that it is optimal for model 2 to learn the digit 3, even if model 3 has a head start.
Even over 0’s it seems to wake up a little bit, albeit with not very high confidence.
When beta is 0.5, it seems that almost always the digit 3 is split. However, for example at beta = 0.9, sometimes they split and sometimes model 2 takes over. When they split, the loss is about 23. When they don’t, it is about 25, which seems right.
One potential issue is that when model 3 trains only by itself, its confidence at the end is not always very high. On the other hand, when model 2 trains by itself on digit 2, its confidence is always very very high.
Maybe when model 3 finishes training by itself, it did not yet have a chance to get really confident about the digit 3, whereas model 2 may already be confident from before and thus have more gradients flowing to it. As a result, it trains faster than model 3, and thus learns the 3’s as well.
One issue with this hypothesis is why is it always model 2 the one that takes over, and not one of the others? Is it because the digit 2 is the closest one to a 3, or is it because it is the one trained right beforehand?
In the worst case, it seems that they split the digit, instead of only one of them learning it.
The difference from 2 is that when model 2 finishes training, it is already very confident in its own digits. As such, when they all train together, it is only natural for it to take over. On the other and, it seems that after model 3 finishes training, it is still “meh” in regards to confidence. As such, since it seems to split the confidence with model 2 right from the start, both of them train together (in the good case). In the bad case, model 2 just overtakes it completely.
Maybe the issue is that digit 2 looks maybe more similar to digit 3, and so model 2 already has some knowhow about 3’s. One interesting idea may be to randomize the digits that we use when training. For example, model 0 may learn digit 9, model 1 maybe learns 5, and so on. In this way we check whether it is the digit’s similarity, or if this always happens.
in order to encourage it to learn one single thing, and learn it well. However, since VAE_0 is not learning anything anymore, maybe we should also decrease the KL-loss weight.
a similar KL loss, by using the Beta-VAE paper trick.
be very confident in your prediction. Another idea in a similar fashion: if you output an image but you have low confidence, don’t even bother.
It does not seem to perform that well; maybe there is a bug in the implementation.It gives much better results than normal deconv.
My intuition is that the VAE’s should all be trained in parallel. In the 3D env, maybe a cylinder is so similar to a box, that the same VAE will model both.