MAE-DFER-CA: Combination with Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition and CA_Module

National Taiwan University of Science and Technology & Department of Computer Science and Information Engineering

📰 News

✨ Overview

Dynamic Facial Expression Recognition (DFER) is facing supervised dillema. On the one hand, current efforts in DFER focus on developing various deep supervised models, but only achieving incremental progress which is mainly attributed to the longstanding lack of large-scale high-quality datasets. On the other hand, due to the ambiguity and subjectivity in facial expression perception, acquiring large-scale high-quality DFER samples is pretty time-consuming and labor-intensive. Considering that there are massive unlabeled facial videos on the Internet, this work aims to explore a new way (i.e., self-supervised learning) which can fully exploit large-scale unlabeled data to largely advance the development of DFER. We also added CA_Module to learn muscle motion between video frames. By learning motion pattern features, we can easily enhance performance.

Overview of our MAE-DFER+CA_Module.

Inspired by recent success of VideoMAE, MAE-DFER makes an early attempt to devise a novel masked autoencoder based self-supervised framework for DFER. It improves VideoMAE by developing an efficient LGI-Former as the encoder and introducing joint masked appearance and motion modeling. With these two core designs, MAE-DFER largely reduces the computational cost (about 38% FLOPs) during fine-tuning while having comparable or even better performance.

The architecture of LGI-Former.

Inspired by the success of MMNET(Micro-expression), we changed the input of CA_Module to make it learn the difference between the first frame and each frame. Make sure we can learn the motion pattern features of whole videos.

The architecture of CA_Module.

Extensive experiments on one DFER dataset show that our MAE-DFER-CA consistently outperforms the MAE-DFER.

🚀 Main Results

✨ FERV39k

🔨 Installation

Main prerequisites:

Python 3.8
PyTorch 1.7.1 (cuda 10.2)
timm==0.4.12
einops==0.6.1
decord==0.6.0
scikit-learn=1.1.3
scipy=1.10.1
pandas==1.5.3
numpy=1.23.4
opencv-python=4.7.0.72
tensorboardX=2.6.1

If some are missing, please refer to environment.yml for more details.

➡️ Data Preparation

Please follow the files (e.g., ferv39k.py) in preprocess for data preparation.

Specifically, you need to enerate annotations for dataloader ("<path_to_video> <video_class>" in annotations). The annotation usually includes train.csv, val.csv and test.csv. The format of *.csv file is like:

dataset_root/video_1  label_1
dataset_root/video_2  label_2
dataset_root/video_3  label_3
...
dataset_root/video_N  label_N

An example of train.csv is shown as follows:

/home/drink36/Desktop/Dataset/39K/Face/Action/Happy/0267 0
/home/drink36/Desktop/Dataset/39K/Face/Action/Happy/0316 0
/home/drink36/Desktop/Dataset/39K/Face/Action/Happy/0090 0

📍Pre-trained Model

Download the model pre-trained on VoxCeleb2 from this link and put it into this folder.

⤴️ Fine-tuning with pre-trained models

FERV39K
```
bash scripts/ferv39k/finetune_local_global_attn_depth16_region_size2510_with_diff_target_164.sh
```
The fine-tuning checkpoints and logs for the four different methods of combining CA_Module and MAE-DFER on FERV39K are as follows:

Method UAR WR Fine-tuned Model

add 42.30 52.40 log / checkpoint

add(4*5) 41.00 51.33 log / checkpoint

add(pos) 42.02 52.08 log / checkpoint

cat 42.20 52.30 log / checkpoint

☎️ Contact

If you have any questions, please feel free to reach me out at ooo910809@gmail.com.

👍 Acknowledgements

This project is built upon VideoMAE, MAE-DFER and MMNET. Thanks for their great codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
figs		figs
logs		logs
preprocess		preprocess
saved		saved
scripts		scripts
.gitignore		.gitignore
CA_block.py		CA_block.py
LICENSE		LICENSE
PC_module.py		PC_module.py
README.md		README.md
datasets.py		datasets.py
engine_for_finetuning.py		engine_for_finetuning.py
environment.yml		environment.yml
functional.py		functional.py
kinetics.py		kinetics.py
masking_generator.py		masking_generator.py
mixup.py		mixup.py
modeling_finetune.py		modeling_finetune.py
optim_factory.py		optim_factory.py
rand_augment.py		rand_augment.py
random_erasing.py		random_erasing.py
run_class_finetuning.py		run_class_finetuning.py
ssv2.py		ssv2.py
transforms.py		transforms.py
utils.py		utils.py
video_transforms.py		video_transforms.py
volume_transforms.py		volume_transforms.py

Method	UAR	WR	Fine-tuned Model
add	42.30	52.40	log / checkpoint
add(4*5)	41.00	51.33	log / checkpoint
add(pos)	42.02	52.08	log / checkpoint
cat	42.20	52.30	log / checkpoint

License

drink36/MAE-DFER-CA

Folders and files

Latest commit

History

Repository files navigation

MAE-DFER-CA: Combination with Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition and CA_Module

📰 News

✨ Overview

🚀 Main Results

✨ FERV39k

🔨 Installation

➡️ Data Preparation

📍Pre-trained Model

⤴️ Fine-tuning with pre-trained models

☎️ Contact

👍 Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Languages