Multimodal Learning inspired by Multitask learning in Video Emotion Recognition

Introduction

This is my on-going project. I use the idea of Multitask learning for Multimodal learning. In our proposed methods, we utilize the idea of multitask learning for multimodal learning. In multitask learning, a prediction of a task can improve the accuracy of the prediction of another task. For multimodal learning, we aim to use a prediction generated from a specific modal to enhance the prediction obtained from another modal or a prediction of the late-fusion production.

frameworks

Dataset

We conducted the experiments on the reorganized IEMOCAP and CMU-MOSEI provided by Dai, and the CH-SIMS dataset. The preprocessed IEMOCAP and CMU-MOSEI can be downloaded here. The CH-SIMS and its baseline can be found here. For all datasets, we used the MTCNN to extract facial images and resized to 260x260.

Environment

  • timm 0.4.5

Results

Test results on IEMOCAP

Model Ang_Ac Ang_F1 Exc_Ac Exc_F1 Fru_Ac Fru_F1 Hap_Ac Hap_F1 Neu_Ac Neu_F1 Sad_Ac Sad_F1 Avg_Ac Avg_F1
Audio 82.1 46.2 55.0 30.3 45.4 42.4 12.8 17.2 48.6 41.4 77.9 48.8 53.6 37.7
Visual 80.9 53.8 84.3 56.3 68.3 53.9 90.1 43.7 75.1 55.3 86.6 55.8 80.9 53.1
Text 86.0 57.9 86.8 56.8 70.1 54.7 90.1 40.5 73.5 49.0 87.0 58.0 82.2 52.8
LF-LSTM 71.2 49.4 79.3 57.2 68.2 51.5 67.2 37.6 66.5 47.0 78.2 54.0 71.8 49.5
LF-Trans 81.9 50.7 85.3 57.3 60.5 49.3 85.2 37.6 72.4 49.7 87.4 57.4 78.8 50.3
EmoEmbs 65.9 48.9 73.5 58.3 68.5 52.0 69.6 38.3 73.6 48.7 80.8 53.0 72.0 49.8
MulT 77.9 60.7 76.9 58.0 72.4 57.0 80.0 46.8 74.9 53.7 83.5 65.4 77.6 56.9
BIMHA 77.2 57.6 78.3 56.1 73.9 54.2 83.4 43.2 76.4 50.9 83.8 63.7 78.8 54.3
CMHA 88.6 61.1 87.9 60.5 75.1 56.3 89.0 45.8 76.5 51.2 88.3 61.6 84.3 56.1
MESE 88.2 62.8 88.3 61.2 74.9 58.4 89.5 47.3 77.0 52.0 88.6 62.2 84.4 57.4
FE2E 88.7 63.9 89.1 61.9 71.2 57.8 90.0 44.8 79.1 58.4 89.1 65.7 85.7 57.1
Le et al 90.1 66.8 88.5 66.8 77.7 57.0 90.5 48.5 78.1 56.6 90.7 69.6 85.9 60.9
Ours 89.5 67.8 90.8 70.7 78.9 59.9 89.9 55.5 79.1 60.6 91.4 72.9 86.6 64.6

Test results on MOSEI

Model Ang_Wa Ang_F1 Dis_Wa Dis_F1 Fea_Wa Fea_F1 Hap_Wa Hap_F1 Sad_Wa Sad_F1 Sur_Wa Sur_F1 Sur_Wa Sur_F1
Audio 53.9 40.5 61.0 35.7 59.0 19.6 50.0 69.3 61.2 45.8 58.4 21.7 57.2 38.8
Visual 58.9 38.2 63.2 37.6 59.1 21.8 55.7 70.3 56.2 42.8 53.0 17.9 57.7 38.1
Text 65.9 48.4 74.0 56.0 62.8 27.0 62.3 72.0 60.2 45.3 60.9 26.0 64.3 45.8
LF-LSTM 64.5 47.1 70.5 49.8 61.7 22.2 61.3 73.2 63.4 47.2 57.1 20.6 63.1 43.3
LF-Trans 65.3 47.7 74.4 51.9 62.1 24.0 60.6 72.9 60.1 45.5 62.1 24.2 64.1 44.4
EmoEmbs 66.8 49.4 69.6 48.7 63.8 23.4 61.2 71.9 60.5 47.5 63.3 24.0 64.2 44.2
MulT] 64.9 47.5 71.6 49.3 62.9 25.3 67.2 75.4 64.0 48.3 61.4 25.6 65.4 45.2
BIMHA 65.3 47.4 70.5 48.9 61.8 24.7 65.8 72.1 62.6 47.9 62.5 24.9 64.8 44.3
CMHA 65.9 49.1 73.6 53.2 63.4 27.3 65.2 72.1 64.2 46.7 64.5 26.6 66.1 45.8
MESE 66.8 49.3 75.6 56.4 65.8 28.9 64.1 72.3 63.0 46.6 65.7 27.2 66.8 46.8
FE2E 66.9 49.5 75.4 57.2 63.8 27.1 61.9 72.3 65.6 49.3 61.5 26.9 65.8 47.0
Le et al 67.5 50.2 76.3 57.0 69.0 29.0 63.0 72.6 65.5 49.2 65.7 27.6 67.8 47.6
Ours 71.4 52.6 81.4 57.4 80.5 30.2 68.5 75.4 63.9 51.6 80.3 30.3 74.3 49.6

Test results on CH-SIMS

Model Annotation Acc2 F1 MAE Corr
EF-LSTM [30] M 69.37 81.91 59.34 -4.39
MFN [31] M 77.86 78.22 45.19 55.18
MulT [3] M 77.94 79.10 48.45 55.94
LF-DNN [28] M 79.87 80.20 42.01 61.23
MLF-DNN [27] M, A, T, V 82.28 82.52 40.64 67.47
LMF [29] M 79.34 79.96 43.99 60.00
MLMF [27] M, A, T, V 82.32 82.66 42.03 63.13
TFN [28] M 80.66 81.62 42.52 61.18
MTFN [27] M, A, T, V 82.45 82.56 40.66 66.98
Self-MM [26] M, A, T, V 80.74 80.78 41.90 61.60
Human-MM [26] M, A, T, V 81.32 81.73 40.80 64.70
Ours M, A, T, V 83.37 83.15 37.61 68.04