A Transformer-based Approach to Video Frame-level Prediction in Affective Behavior Analysis In-the-wild

Introduction

5th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW5)

Our repo for the competition is here. The feature and trained models can be found in this archive

Our paper is accepted at 11th International Conference on Big Data Applications and Services.

Dependency

We borrow EfficientNet of Savchenko, we must use the exact version of timm:

pip install timm==0.4.5

The pre-trained EfficientNet B0 on Facial Behavior Tasks of Savchenko is in this project

Model Architecture

image info

Result

Evaluation metrics of Expression classification on Aff-Wild2 Validation set

Model F1
Effnet+MLP 0.3327
Effnet+Transformer Encoder (N=4, h=4) 0.3615
Effnet+Transformer Encoder (N=4, h=4), Augment (1) 0.4400
Effnet+Transformer Encoder (N=4, h=8) , Augment (2) 0.4424
Effnet+Transformer Encoder (N=6, h=4) , Augment (3) 0.4555
Average Ensemble (1)(2) 0.4663
Average Ensemble (1)(3) 0.4672
Average Ensemble (3)(2) 0.4729
Average Ensemble (1)(2)(3) 0.4775

Evaluation metrics of Valence-Arousal estimation on Aff-Wild2 Validation set

Model F1
Effnet+Transformer Encoder (N=4, h=4) (1) 0.48296
Effnet+Transformer Encoder (N=4, h=8) (2) 0.48819
Effnet+Transformer Encoder (N=6, h=4) (3) 0.47389
Average Ensemble (1)(2) 0.49684
Average Ensemble (1)(3) 0.49679
Average Ensemble (3)(2) 0.49874
Average Ensemble (1)(2)(3) 0.50290

Evaluation metrics of Action Unit Detection on Aff-Wild2 Validation set

Model F1
Effnet+Transformer Encoder (N=4, h=4) (1) 0.51696
Effnet+Transformer Encoder (N=4, h=8) (2) 0.51146
Effnet+Transformer Encoder (N=6, h=4) (3) 0.51192
Average Ensemble (1)(2) 0.51960
Average Ensemble (1)(3) 0.52021
Average Ensemble (3)(2) 0.51709
Average Ensemble (1)(2)(3) 0.52085