GestureGAN

A project where I train a generative adversarial network, or GAN (more specifically, a deep convolutional generative adversarial network, or DCGAN) to generate images of certain hand gestures.

Data Collection

A total of 1500 image samples were collected, according to the following format:

300 photos taken over 5 locations.
In each location, 50 samples were allocated for each gesture, with 100 being control samples (no gesture).
For the control samples, 25 had a face, 25 had a hand, 25 had neither, and 25 had both.
For the gestures, 25 were recorded for each hand in various positions, with approximately half (12-13) of the 25 including faces, with the other half not including any face.
the gestures recorded are: the middle finger, the ok sign, the thumbs up, and the peace sign (two fingers up)

A few sample images are shown below:

Photos were taken using an iPhone 13 Pro in the HEIC format, with 2316x3088 pixel resolution. When ported to a Windows PC, their resolution changed, with some images being 756x1008 and some being 579x772.

Issues around the data collection

The largest issue is that a single person (myself) was used to take the photos; the model has learned to recognize gestures from people with my hand size, structure, and skin color. For example, if someone with darker skin and a larger hand than mine performed gestures, the model may struggle to recognize their gestures. I predict a more complex architecture than the one ultimately chosen would need to be used to effectivly capture these differences.

Another issue is the size of the dataset; there are only 1500 samples collected, and even after preprocessing (described in the next section), 12000 samples are available. This is far less than datasets used to train state-of-the-art models (such as ImageNet), which contain millions of samples.

Finally, the size of the images; the photos are far larger than popular image dataset sizes (MNIST, for instance, uses 28x28 photos [1]).

Preprocessing

Preprocessing the data involves a few steps: we first convert the HEIC images to PNG. To ensure all images are the same size, we pad them to a size of 64x64, which also is intended to match the image size in the paper we reference.

Generator and Discriminator Architecture

We will be using the generator and discriminator architectures from the paper "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks" [2].

The architecture is visually shown below:

We use a latent vector size of 100, as well as a feature mapping size of 64 for both the generator and the discriminator.

Optimizer and Loss

For both the generator and discriminator, we use binary cross entrpy (BCE) Loss. Both will also use the Adam optimizer.

Best Practices Followed [3]

using a DCGAN for image data.
We train on real and fake image batches separately.
generating latent vectors from a Gaussian rather than a uniform distribution
using the tanh activation function on the generator's last layer
flipping image labels when training generator (i.e. real = fake, fake = real)
using LeakyRELU instead of RELU to avoid sparse gradients.
one sided label smoothing for positive labels, reducing the vulnerability of the networks to competing examples. In our case, we use 0.9 as a positive label.

We intentionally use unlabeled data to observe what kind of gestures the generator will create if let to its own devices. Later, though, we will experiment with using labelled data (i.e. making an auxiliary GAN).

Training

During training at each iteration, we train the discriminator and generator for one iteration (equivalent to setting k = 1, as described in [4]).

Evaluation

Once the generator is trained, we generate 64 latent vectors, and feed them into the generator to observe what kind of images it is generating.

See the file for a list of all trials.

Final Results

The final model was trained with a learning rate of 0.001 for the discriminator, 0.002 for the generator, a batch_size of 250, 1500 epochs, a beta1 value of 0.4, and a beta2 value of 0.7.

Here are some sample images generated by the generator:

We will now need to evaluate the model and how well it replicated the data distribution. This will be done using the Frechet distance, as it is the current standard for evaluating generative models [5].

Using a random seed of 1689558219.2381685, 2500 images were generated using Gaussian latent vectors. This yielded a FID of 208.65930591943527. This is absolutely horrible, and here are some possible reasons why related to the data:

the dataset is far too small; 1500 images, even with augmentation to several dozen thousand, it is not enough for a classification task for 5 classes, especially since 4 of the classes are relatively similar (the gestures) compared to one of them (the no gesture class).
the variance in the images is high. This can lead to overfitting, as high variance in the images makes it difficult for the model to learn underlying structure.
model architecture is too complex; a complex model architecture can lead to the model "memorizing the data", as the abundance of parameters will lead to more fine tuning, and thus, more variance in the model predictions caused by the extra parameters picking up noise.
the image resolution is too low (64 x 64 is pretty small for images considering the usual size is 224 x 224).

As for the model itself:

GANs in particular are difficult to train; the adversarial environment means there are false modes that the generator can converge to, effectively fooling the discriminator with garbage.
There needs to be a balance between how much the generator trains vs the discriminator for each batch at every epoch. Sometimes you need to train the generator more since the discriminator learns faster, and vice versa.

Overall, a great learning experience.

Extensions

make a conditional GAN that generates different gestures depending on a given gesture (i.e. generate a thumbs up).
test the architecture on other datasets (such as MNIST, Fashion MNIST, or ImageNet). This data was collected myself, which I have learned the hard way is not good at all.
alter the training pipeline to favor either the generator or discriminator for training (i.e. train one or the other for more than one iteration to control the speed of learning).
not quite an extension, but I plan to try making my own diffusion model architecture for a given dataset.

References

[1] https://en.wikipedia.org/wiki/MNIST_database

[2] https://arxiv.org/abs/1511.06434

[3] https://github.com/soumith/ganhacks

[4] https://arxiv.org/abs/1406.2661

[5] https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
__pycache__		__pycache__
.gitignore		.gitignore
GestureGenerator.pth		GestureGenerator.pth
IMG_0276.png		IMG_0276.png
IMG_0675.png		IMG_0675.png
IMG_1478-0.png		IMG_1478-0.png
README.md		README.md
Trials.md		Trials.md
dcgan.png		dcgan.png
discriminator.py		discriminator.py
gan_pipeline.py		gan_pipeline.py
generated_images_1.png		generated_images_1.png
generated_images_10.png		generated_images_10.png
generated_images_11.png		generated_images_11.png
generated_images_12.png		generated_images_12.png
generated_images_13.png		generated_images_13.png
generated_images_14.png		generated_images_14.png
generated_images_15.png		generated_images_15.png
generated_images_16.png		generated_images_16.png
generated_images_2.png		generated_images_2.png
generated_images_3.png		generated_images_3.png
generated_images_4.png		generated_images_4.png
generated_images_5.png		generated_images_5.png
generated_images_6.png		generated_images_6.png
generated_images_7.png		generated_images_7.png
generated_images_8.png		generated_images_8.png
generated_images_9.png		generated_images_9.png
generator.py		generator.py
generator_training.py		generator_training.py
loss_function_graphs_1.png		loss_function_graphs_1.png
loss_function_graphs_10.png		loss_function_graphs_10.png
loss_function_graphs_11.png		loss_function_graphs_11.png
loss_function_graphs_12.png		loss_function_graphs_12.png
loss_function_graphs_13.png		loss_function_graphs_13.png
loss_function_graphs_14.png		loss_function_graphs_14.png
loss_function_graphs_15.png		loss_function_graphs_15.png
loss_function_graphs_16.png		loss_function_graphs_16.png
loss_function_graphs_2.png		loss_function_graphs_2.png
loss_function_graphs_3.png		loss_function_graphs_3.png
loss_function_graphs_4.png		loss_function_graphs_4.png
loss_function_graphs_5.png		loss_function_graphs_5.png
loss_function_graphs_6.png		loss_function_graphs_6.png
loss_function_graphs_7.png		loss_function_graphs_7.png
loss_function_graphs_8.png		loss_function_graphs_8.png
loss_function_graphs_9.png		loss_function_graphs_9.png
pipeline_setup.py		pipeline_setup.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
sample_1.png		sample_1.png
sample_2.png		sample_2.png
sample_3.png		sample_3.png
sample_4.png		sample_4.png
sample_5.png		sample_5.png

Supremolink81/GestureGAN

Folders and files

Latest commit

History

Repository files navigation

GestureGAN

Data Collection

Issues around the data collection

Preprocessing

Generator and Discriminator Architecture

Optimizer and Loss

Best Practices Followed [3]

Training

Evaluation

Final Results

Extensions

References

About

Topics

Resources

Stars

Watchers

Forks

Languages