Faculty of Computer Science, Higher School of Economics
Scientific advisors: Aibek Alanov, Maksim Nakhodnov
Modern generative adversarial networks (GANs) enable the synthesis of high-quality images and provide tools for fine-grained image manipulation. However, out-of-domain generation requires an additional fine-tuning of a generator or non-flexible latent optimization, which requires training for each new image. A novel encoder, which learns offsets in a S space and allows a GAN generator to adapt to a new domain in a single pass, is proposed and implemented. An ablation study of loss components and usage of projection embeddings is conducted. A method for regularization in a multi-domain setting is proposed. A thorough comparison with prior domain adaptation methods is made.
In this work we focus on an encoder-based approach, inspired by BlendGAN, for imaged-based domain adaptation. Similarly, our task is to train an encoder, which takes a domain image and returns it's representation that can be incorporated within a generator to synthesize adapted images to a new domain. We call this approach as Domain Encoder GAN (DEGAN). Crucial aspect is that neither generator, nor discriminator are trained within this pipeline, making a training process easier and more flexible. In our method an encoder predicts offsets in a StyleSpace, taking into account the results from StyleDomain paper, which show that parameterization in a StyleSpace is compact and efficient comparing to others. Generally, our optimization process looks like
where
Our training pipeline looks as follows:
- Sample a latent
$z \sim \mathcal{N}(0, I)$ and propagate it through a mapping network and affine layers to get a style vector$s \in \mathcal{S}$ , which plays role of representation of a source image. - Given a domain image
$I_d$ , predict a domain offset in a StyleSpace using a domain encoder$\phi(I_d) = \Delta s$ . - Generate a source image
$I_s = G(s)$ and an adapted image$I_g = G(s + \Delta s)$ using a frozen generator. - Get embeddings
$E_{CLIP}(I_s), E_{CLIP}(I_d), E_{CLIP}(I_g)$ using a pretrained ViT-B/16 CLIP. - Based on the objects above, calculate a loss function and make a step of an optimizer.
Following latest achievements in computer vision, we choose ViT-B/16 as a backbone for a domain encoder. Because we want to predict offsets in a StyleSpace of dimensionality 9088, there will be a bottleneck, if we simply train a head above the [CLS] embedding of dimensionality 768 after the last encoder layer of ViT. Furthermore, different layers of ViT operate on different image resolutions. That coincides with the structure of a StyleSpace, that consists of vectors, which are utilized on 9 groups of resolutions from 4x4 to 1024x1024. These facts motivated us to add trainable heads on 9 last layers of a backbone, such that the predictions for a particular resolution are consistent with the resolution of an image being processed. StyleGAN2 generator uses two convolution and one ToRGB layers on each resolution, except for 4x4. Therefore, each head takes a [CLS] embedding from the current layer and propagates it through several independent 3-layer MLPs with LayerNorm and GeLU activation to predict an offset for a particular layer in a generator. Our domain encoder has about 37 million parameters. A detailed version of an architecture of a domain encoder is presented in figure below.
Training config can be found in configs
.
python train.py