Training Loss v/s Epochs :

GAN Architectures

DCGAN
wDCGAN-Weight Clipping
wDCGAN-Gradient Penalty

Deep Convolutional Generative Adversarial Networks (DCGANs)

The idea behind GANs is to train two networks jointly:

A generator G to map a Z following a [simple] fixed distribution to the desired "real" distribution, and
a discriminator D to classify data points as "real" or "fake" (i.e. from G).

The approach is adversarial since the two networks have antagonistic objectives.

GANs have been known to be unstable to train, often resulting in generators that produce nonsensical outputs. There has been very limited published research in trying to understand and visualize what GANs learn, and the intermediate representations of multi-layer GANs.

Following excerpt from the paper makes it quite evident :

"We also encountered difficulties attempting to scale GANs using CNN architectures commonly used in the supervised literature. However, after extensive model exploration we identified a family of architectures that resulted in stable training across a range of datasets and allowed for training higher resolution and deeper generative models."

Paper on DCGAN(Radford et.al 2015) proposes and evaluates a set of constraints on the architectural topology of Convolutional GANs that make them stable to train in most settings.

Architectural settings Proposed

Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator) ; often miscoined as Deconvolution.
Use batchnorm in both the generator and the discriminator (stable gradient flow across layers)
Remove fully connected hidden layers for deeper architectures.
Use ReLU activation in generator for all layers except for the output, which uses Tanh. (bounded activation saturates the model quickly)
Use LeakyReLU (alpha = 0.2) activation in the discriminator for all layers (to avoid vanishing gradients)

Training Details

Images need to be rescaled to 64 x 64 before feeding to the network.
Adam optimizer was used with a mini batch size = 128
All weights were initialized from a zero-centered Normal distribution with standard deviation 0.02. (still works fairly well without this tweak)
learning rate = 0.0002 overriding the default 0.001

Training Loss v/s Epochs :

I will be running a few more epochs to check for any kind of improvements in Generator Performance further.

Results :

With a uniform distribution Z constant for every epoch (same digit in the block throughout all epochs)
With a uniform distribution Z changing every epoch. (different digits in the block for different epochs)

Wasserstein GAN (WGAN)

1. Weight Clipping (Originally proposed)

The traditional GAN loss function works makes use of Jensen-Shannon Divergence which does not account much for the metric space. An alternative choice is the "earth moving distance", which intuitively is the minimum mass displacement to transform one distribution into the other.

WGANs cure the main training problems of GANs. In particular, training WGANs does not require maintaining a careful balance in training of the discriminator and the generator, and does not require a careful design of the network architecture either. One of the most compelling practical benefits of WGANs is the ability to continuously estimate the EM (Wasserstein) distance by training the discriminator to optimality.

The two benefits observed on using Wasserstein Distace for training :

A greater stability of the learning process ; does not witness "mode collapse"
A greater interpretability of the loss, which is a better indicator of the quality of the samples.

Following excerpt from paper points out one of its major drawbacks :

"If the clipping parameter is large, then it can take a long time for any weights to reach their limit, thereby making it harder to train the critic till optimality. If the clipping is small, this can easily lead to vanishing gradients when the number of layers is big, or batch normalization is not used"

Traning Details

MNIST -----------