Skip to content

Multilingual-StyleCLIP is a model that can edit StyleGAN2 's images with a multilingual text prompt

Notifications You must be signed in to change notification settings

esoyeon/Multilingual-StyleCLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual-StyleCLIP

  • Global direction Notebook: Open In Colab
  • Latent optimization Notebook: Open In Colab
  • Latent mapper Notebook: Open In Colab

Overview

Since the release of CLIP by OpenAPI, multiple applications of this multi-modal model have been made, including StyleCLIP. StyleCLIP is a combination of high-resolution image generator - StyleGAN and text-image connecter - CLIP. By measuring cosine similarities of text vector generated by CLIP and image vector generated by StyleGAN, StyleCLIP makes it possible to conveniently manipulate an image with a text prompt.

We further extended the benefits of StyleCLIP by implementing Multilingual-CLIP to this model. Multilingual-CLIP consists of two encoders: an image encoder and a fine-tuned text encoder that is capable of encoding any language. Thus, our version of StyleCLIP manipulates an image not only with an English text prompt, but also with a text prompt in any other language, for example in Korean.

Accuracy of image encoding task also has increased. Official image encoder in StyleCLIP is Encoder4Encoding(e4e) which plays its role when training and testing. However empirically we found out that the result of e4e is quite different from the original input image. To overcome this issue, we encoded data sets for training a mapper and images for inference with Restyle Encoder. Restlye Encoder which was introduced in the paper “ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement (ICCV 2021)” iteratively self-corrects the inverted latent code, resulting in increased accuracy.

This repository contains:

  • Pytorch training code for Multilingual Latent Optimizatizer, Latent Mapper, Global Direction
  • Pytorch inference code Multilingual Latent Optimizatizer, Latent Mapper, Global Direction
  • latent mapper, global direction weights
  • CelebA-HQ Dataset latents (encoded via Restlye)
  • Restlye encoder applied over pSp pretrained on the FFHQ dataset
  • Huggingface available transformer M-BERT Base ViT-B
  • CLIP
  • StyleGAN2

Setup

The experiment was done in following conditions:

  • Python 3.7.12
  • Torch 1.10.0+cu11
  • Google Colab

Latent optimization

The code relies on Rosinality pytorch implementation of StyleGAN2. Facial Recognition weights and pretrained restyle encoder are to be downloaded here.

  • --description is for the driving text (can be in any language).
  • To control the manipulation effect, adjust l2 lambda and ID lambda parameters

Usage

Given a textual description, one can both edit a given image, or generate a random image that best fits to the description. Both operations can be done through the main.py script, or the optimization_playground.ipynb notebook Open In Colab.

Latent mapper

The code relies on Rosinality pytorch implementation of StyleGAN2. Open In Colab

training

  • This repository trains the mapper with dataset that was inverted by e4e encoder instead of restyle encoder.
  • Inferencing on restyle encoder-inverted images works just fine.
  • e4e encoder-inverted dataset is located in the original StyleClip repository.
  • To resume a training, provide --checkpoint_path.
  • --description is for the driving text (can be in any language).
  • To control the manipulation effect, adjust l2 lambda and ID lambda parameters
  • Takes up 10 hours for proper training
!python models/mapper/scripts/train.py --exp_dir exp_dir --no_fine_mapper --description "보라색 머리카락을 가진 사람" \
--latents_train_path data/celebA/train_faces.pt --latents_test_path data/celebA/test_faces.pt \

Inference

Global Direction

The code relies on the official TensorFlow implementation of StyleGAN2. Facial Recognition weights and pretrained restyle encoder are to be downloaded here.

Usage

Open the notebook in colab and run all the cells. Open In Colab

In the last cell you can play with the image. beta corresponds to the disentanglement threshold, and alpha to the manipulation strength. After you set the desired set of parameters, please run again the last cell to generate the image.

Editing Examples

encoder results comparison

encoder 성능 비교

Images below are from celebA-HQ, and were inverted into latent space via Restyle Encoder.

Latent optimization

  • Compare results in other languages : English, Korean, Chinese, Russian
  • Original

    original

  • Text prompt "a person with purple hair" global_img_3

Latent mapper

  • Compare results in other languages : English, Korean, Russian, Japanese
  • Original

    original

  • Text prompt "a child" global_img_3

Global Direction

  • text prompt "a smiling face" and "man's face" in Korean. global_img_1

  • Compare results in other languages : English, Korean, Chinese, Spanish

  • Original

    original

  • Text prompt "a smiling face" global_img_2

  • Text prompt "a male face" global_img_3

Acknowledgement

About

Multilingual-StyleCLIP is a model that can edit StyleGAN2 's images with a multilingual text prompt

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages