Wasserstein GAN: https://arxiv.org/abs/1701.07875

It is hard to write a short summary for such a great work! Some good references are here http://www.alexirpan.com/2017/02/22/wasserstein-gan.html and here https://vincentherrmann.github.io/blog/wasserstein/ It has won widespread praise for its novelties and beautiful results. For me there are 4 take-home messages from the paper:

It shows that encoding a high-dimensional space into a lower-dimensional space is pretty much a task that traditional metrics such as KL-divergence and JS-divergence are not applicable (seeing the (abeit contrived) example to see why)). I am particularly interested in this, perhaps because I used to love KL-divergece very much (even after reading http://www.inference.vc/how-to-train-your-generative-models-why-generative-adversarial-networks-work-so-well-2/). Now we know for sure these divergences are not fitting to the problem, at least from the perspective of optimization.
It shows that Earth-Mover distance (a.k.a Wasserstein distance) should be a better match. Let us recall that probability distributions are defined by how much mass they put on each point. The distance between two distances P_original and P_new is the minimal effort we need to move the mass (according to P_original) for all points in the space to a new space with the new mass (according to P_new). I have a bit difficulty of understanding the distance in the beginning, but I was (as everybody) indeed very pleased with its the intuition. I don't re-explain it again, and a good explanation of the distance can be found here https://www.cph-ai-lab.com/wasserstein-gan-wgan)
The Earth-Mover distance is very nice, but intractable to compute. Their ingenious approach is to rely on the Kantorovich-Rubinstein dualty to reformulate the computation of Earth-Mover distance (see this for an explanation about K-R duality https://vincentherrmann.github.io/blog/wasserstein/). Instead of computing the metric over each ``point" in the space, we can compute the metric based on the transformation of that point to another point in the same space. There is some constraint about the transformation, implying that the transformation function must be satisfied something. More particularly, they are 1-Lipschitz, i.e., functions that are differentiable and their maximum value of slope is 1. In this way, computing the distance amounts to finding a transformation function f that maximizes a new and different objective. They dualty doesn't change the difficulty of the problem, of course, but it makes our life easier to perform an approximation. Let us assume a parametrized function family f_w where w are the weights. A lower bound objective that can be close to the true distance is about finding the weights to maximize our new objective - a pretty much trivial thing to do with neural networks (i.e. backpropagation). Interestingly, the functions do not have to be 1-Lipschitz, as they could be K-Lipschitz (the paper explains very well why it is the case). But in principle, K should not be too large and too small. They clip weights w, constraining them to lie with a small range [-0.01, 0.01] to make it work (an adhoc and not a well-understanding solution, however).
The training algorithm does not change much from the original GAN's training algorithm (it basically gets rid of ``unnecessary" logarithms and sigmoids). But we could gain pretty much from that (stable training). The new loss quite corresponds with image quality, which is pretty amazing. Too good to be true, right? Alas, that is not all their contributions. They even prove (Theorem 1 and 2) that everything comes for a reason!!!

What a paper! There is no doubt that this is one of the best papers I read recently!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wasserstein GAN.md

Wasserstein GAN.md

Files

Wasserstein GAN.md

Latest commit

History

Wasserstein GAN.md

File metadata and controls