MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition (ACM MM 2023)

[arXiv], [ACM Digital Library]
Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao
University of Chinese Academy of Sciences & Institute of Automation, Chinese Academy of Sciences & Tsinghua University

📰 News

[2024.01.02] We provide the fine-tuned models across five folds on MAFW. Please check them below.
[2023.12.31] We upload the pre-training code.
[2023.12.13] We provide the fine-tuned model on FERV39k. Please check it below.
[2023.12.11] We provide the fine-tuned models across five folds on DFEW. Please check them below.
[2023.10.31] We upload the poster of MAE-DFER for ACM MM 2023.

✨ Overview

Dynamic Facial Expression Recognition (DFER) is facing supervised dillema. On the one hand, current efforts in DFER focus on developing various deep supervised models, but only achieving incremental progress which is mainly attributed to the longstanding lack of large-scale high-quality datasets. On the other hand, due to the ambiguity and subjectivity in facial expression perception, acquiring large-scale high-quality DFER samples is pretty time-consuming and labor-intensive. Considering that there are massive unlabeled facial videos on the Internet, this work aims to explore a new way (i.e., self-supervised learning) which can fully exploit large-scale unlabeled data to largely advance the development of DFER.

Overview of our MAE-DFER.

Inspired by recent success of VideoMAE, MAE-DFER makes an early attempt to devise a novel masked autoencoder based self-supervised framework for DFER. It improves VideoMAE by developing an efficient LGI-Former as the encoder and introducing joint masked appearance and motion modeling. With these two core designs, MAE-DFER largely reduces the computational cost (about 38% FLOPs) during fine-tuning while having comparable or even better performance.

The architecture of LGI-Former.

Extensive experiments on six DFER datasets show that our MAE-DFER consistently outperforms the previous best supervised methods by significant margins (+5∼8% UAR on three in-the-wild datasets and +7∼12% WAR on three lab-controlled datasets), which demonstrates that it can learn powerful dynamic facial representations for DFER via large-scale self-supervised pre-training. We believe MAE-DFER has paved a new way for the advancement of DFER and can inspire more relevant research in this field and even other related tasks (e.g., dynamic micro-expression recognition and facial action unit detection).

🚀 Main Results

✨ DFEW

✨ FERV39k

✨ MAFW

👀 Visualization

✨ Reconstruction

More samples without showing frame difference:

✨ t-SNE on DFEW

🔨 Installation

Main prerequisites:

Python 3.8
PyTorch 1.7.1 (cuda 10.2)
timm==0.4.12
einops==0.6.1
decord==0.6.0
scikit-learn=1.1.3
scipy=1.10.1
pandas==1.5.3
numpy=1.23.4
opencv-python=4.7.0.72
tensorboardX=2.6.1

If some are missing, please refer to environment.yml for more details.

➡️ Data Preparation

Please follow the files (e.g., dfew.py) in preprocess for data preparation.

Specifically, you need to enerate annotations for dataloader ("<path_to_video> <video_class>" in annotations). The annotation usually includes train.csv, val.csv and test.csv. The format of *.csv file is like:

dataset_root/video_1  label_1
dataset_root/video_2  label_2
dataset_root/video_3  label_3
...
dataset_root/video_N  label_N

An example of train.csv of DFEW fold1 (fd1) is shown as follows:

/mnt/data1/brain/AC/Dataset/DFEW/Clip/jpg_256/02522 5
/mnt/data1/brain/AC/Dataset/DFEW/Clip/jpg_256/02536 5
/mnt/data1/brain/AC/Dataset/DFEW/Clip/jpg_256/02578 6

Note that, label for the pre-training dataset (i.e., VoxCeleb2) is dummy label, you can simply use 0 (see voxceleb2.py).

🔄 Pre-training MAE-DFER

VoxCeleb2

sh scripts/voxceleb2/pretrain_local_global_attn_depth16_region_size2510_with_diff_target_102.sh

You can download our pre-trained model on VoxCeleb2 from here and put it into this folder.

⤴️ Fine-tuning with pre-trained models

DFEW

sh scripts/dfew/finetune_local_global_attn_depth16_region_size2510_with_diff_target_164.sh

The fine-tuned checkpoints and logs across five folds on DFEW are provided as follows:

Fold	UAR	WR	Fine-tuned Model
1	62.59	74.88	log / checkpoint
2	61.96	72.49	log / checkpoint
3	64.00	74.91	log / checkpoint
4	63.07	74.05	log / checkpoint
5	65.42	75.81	log / checkpoint
Total	63.41	74.43	-

FERV39k
```
sh scripts/ferv39k/finetune_local_global_attn_depth16_region_size2510_with_diff_target_164.sh
```
The fine-tuned checkpoints and logs on FERV39k are provided as follows:

Version UAR WR Fine-tuned Model

Reproduced 43.29 52.50 log / checkpoint

Reported 43.12 52.07 log / -

Note that we lost the original ckpt for this dataset. However, the reproduced result is slightly better than that reported in the paper.

MAFW

sh scripts/mafw/finetune_local_global_attn_depth16_region_size2510_with_diff_target_164.sh

The fine-tuned checkpoints and logs across five folds on MAFW are provided as follows:

Fold	UAR	WR	Fine-tuned Model
1	36.11	46.71	log / checkpoint
2	42.37	54.82	log / checkpoint
3	46.25	58.87	log / checkpoint
4	45.42	59.50	log / checkpoint
5	41.66	55.27	log / checkpoint
Total (Reproduced)	42.36	55.03	-
Total (Reported)	41.62	54.31	-

Note that we lost the original ckpts for this dataset. However, the reproduced result is slightly better than that reported in the paper.

☎️ Contact

If you have any questions, please feel free to reach me out at sunlicai2019@ia.ac.cn.

👍 Acknowledgements

This project is built upon VideoMAE. Thanks for their great codebase.

✏️ Citation

If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:

@inproceedings{sun2023mae,
    author = {Sun, Licai and Lian, Zheng and Liu, Bin and Tao, Jianhua},
    title = {MAE-DFER: Efficient Masked Autoencoder for Self-Supervised Dynamic Facial Expression Recognition},
    year = {2023},
    booktitle = {Proceedings of the 31st ACM International Conference on Multimedia},
    pages = {6110–6121}
}

@article{sun2023mae,
  title={MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition},
  author={Sun, Licai and Lian, Zheng and Liu, Bin and Tao, Jianhua},
  journal={arXiv preprint arXiv:2307.02227},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
figs		figs
logs		logs
preprocess		preprocess
saved		saved
scripts		scripts
LICENSE		LICENSE
README.md		README.md
datasets.py		datasets.py
engine_for_finetuning.py		engine_for_finetuning.py
engine_for_pretraining.py		engine_for_pretraining.py
environment.yml		environment.yml
functional.py		functional.py
kinetics.py		kinetics.py
masking_generator.py		masking_generator.py
mixup.py		mixup.py
modeling_finetune.py		modeling_finetune.py
modeling_pretrain.py		modeling_pretrain.py
optim_factory.py		optim_factory.py
rand_augment.py		rand_augment.py
random_erasing.py		random_erasing.py
run_class_finetuning.py		run_class_finetuning.py
run_mae_pretraining.py		run_mae_pretraining.py
ssv2.py		ssv2.py
transforms.py		transforms.py
utils.py		utils.py
video_transforms.py		video_transforms.py
volume_transforms.py		volume_transforms.py

Version	UAR	WR	Fine-tuned Model
Reproduced	43.29	52.50	log / checkpoint
Reported	43.12	52.07	log / -

License

sunlicai/MAE-DFER

Folders and files

Latest commit

History

Repository files navigation