Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

Will appear at CVPR 2024!

TL; DR

Make-Your-Anchor is a personalized 2d avatar generation framework based on diffusion model, which is capable of generating realistic human videos with SMPL-X sequences as condition.

Abstract

Despite the remarkable process of talking-head-based avatar-creating solutions, directly generating anchor-style videos with full-body motions remains challenging. In this study, we propose Make-Your-Anchor, a novel system necessitating only a one-minute video clip of an individual for training, subsequently enabling the automatic generation of anchor-style videos with precise torso and hand movements. Specifically, we finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances. We adopt a two-stage training strategy for the diffusion model, effectively binding movements with specific appearances. To produce arbitrary long temporal video, we extend the 2D U-Net in the frame-wise diffusion model to a 3D style without additional training cost, and a simple yet effective batch-overlapped temporal denoising module is proposed to bypass the constraints on video length during inference. Finally, a novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos. Comparative experiments demonstrate the effectiveness and superiority of the system in terms of visual quality, temporal coherence, and identity preservation, outperforming SOTA diffusion/non-diffusion methods.

Pipeline

Notes

The self-collected data described in the paper will not be released due to the privacy, while we release the model trained with open dataset.
As a person-specific approach, we plan to release the pre-trained weight from pre-training stage, and the fine-tuning code. The guidance and code for preprocess training data will be updated.
Due to the limitation of current training dataset, our method performs better when the driven motion is in a similar style as the target person (as cross-person result shows). We plan to increase the quantity of pre-training and fine-tuning data to overcome this limitation.

Changelog

[2024.04.22]: Release the inference code and pretrained weights.

TODO

Inference code and checkpoints
Preprocess code and guidance
Fine-tuning code and pre-trained weights

Getting Started

Environment

Our code is based on PyTorch and Diffusers. Recommended requirements can be installed via

pip install -r requirements.txt

To process videos, FFmpeg is required to be installed.

For face alignment, please download and unzip the relative files from this link to the folder .\inference\insightface_func\models\.

Download Inference Checkpoints

Please download the checkpoints from Google Drive, and place them in the folder ./inference/checkpoints. Currently, we upload the checkpoints trained from open-dataset.

Inference

We provide the inference code with our released checkpoints. After download/fine-tuned the checkpoints and place them in the ./inference/checkpoints, the inference can be run as:

bash inference.sh

Specifically, five parameters should be filled with your configuration in the inference.sh:

## Please fill the parameters here
# path to the body model folder
body_weight_dir=./checkpoints/seth/body
# path to the head model folder
head_weight_dir=./checkpoints/seth/head
# path to the input poses
body_input_dir=./samples/poses/seth1
# path to the reference body appearance
body_prompt_img_pth=./samples/appearance/body.png
# path to the reference head appearance
head_prompt_img_pth=./samples/appearance/head.png

After generation (it takes about 5 minutes), the results are listed in the ./inference/samples/output.

Video Results

Comparisons

main1.mp4

main2.mp4

Audio-Driven Results

audio-driven.mp4

Ablations

ablation.mp4

Cross-person Results

cross-person.mp4

Full-body Results

fullbody.mp4

Citation

@article{huang2024makeyouranchor,
  title={Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework},
  author={Huang, Ziyao and Tang, Fan and Zhang, Yong and Cun, Xiaodong and Cao, Juan and Li, Jintao and Lee, Tong-Yee},
  journal={arXiv preprint arXiv:2403.16510},
  year={2024}
}

Acknowledgements

Here are some great resources we benefit:

TalkSHOW for preprocess and audio-driven inference
SimSwap for the code of face preprocess
ControlVideo for the implementation of full-frame attention.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
inference		inference
README.md		README.md
inference.sh		inference.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

inference

inference

README.md

README.md

inference.sh

inference.sh

requirements.txt

requirements.txt

Repository files navigation

Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

TL; DR

Abstract

Pipeline

Notes

Changelog

TODO

Getting Started

Environment

Download Inference Checkpoints

Inference

Video Results

Comparisons

Audio-Driven Results

Ablations

Cross-person Results

Full-body Results

Citation

Acknowledgements

About

Releases

Packages

Languages

ICTMCG/Make-Your-Anchor

Folders and files

Latest commit

History

Repository files navigation

Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

TL; DR

Abstract

Pipeline

Notes

Changelog

TODO

Getting Started

Environment

Download Inference Checkpoints

Inference

Video Results

Comparisons

Audio-Driven Results

Ablations

Cross-person Results

Full-body Results

Citation

Acknowledgements

About

Resources

Stars

Watchers

Forks

Languages