In this work, we explore the effectiveness of multimodal models for estimating the emotional state expressed continuously in the Valence/Arousal space. We consider four modalities typically adopted for the emotion recognition, namely audio (voice), video (face expression), electrocardiogram (ECG), and electrodermal activity (EDA), investigating different mixtures of them. To this aim, a CNN-based feature extraction module is adopted for each of the considered modalities, and an RNN-based module for modelling the dynamics of the affective behaviour. The fusion is performed in three different ways: at feature-level (after the CNN feature extraction), at model-level (combining the RNN layer’s outputs) and at prediction-level (late fusion). Results obtained on the publicly available RECOLA dataset, demonstrate that the use of multiple modalities improves the prediction performance. The best results are achieved exploiting the contribution of all the considered modalities, and employing the late fusion, but even mixtures of two modalities (especially audio and video) bring significant benefits.

Patania, S., D’Amelio, A., Lanzarotti, R. (2022). Exploring Fusion Strategies in Deep Multimodal Affect Prediction. In Image Analysis and Processing – ICIAP 2022 21st International Conference, Lecce, Italy, May 23–27, 2022, Proceedings, Part II (pp.730-741). Springer Verlag [10.1007/978-3-031-06430-2_61].

Exploring Fusion Strategies in Deep Multimodal Affect Prediction

Patania, S
Primo
;
2022

Abstract

In this work, we explore the effectiveness of multimodal models for estimating the emotional state expressed continuously in the Valence/Arousal space. We consider four modalities typically adopted for the emotion recognition, namely audio (voice), video (face expression), electrocardiogram (ECG), and electrodermal activity (EDA), investigating different mixtures of them. To this aim, a CNN-based feature extraction module is adopted for each of the considered modalities, and an RNN-based module for modelling the dynamics of the affective behaviour. The fusion is performed in three different ways: at feature-level (after the CNN feature extraction), at model-level (combining the RNN layer’s outputs) and at prediction-level (late fusion). Results obtained on the publicly available RECOLA dataset, demonstrate that the use of multiple modalities improves the prediction performance. The best results are achieved exploiting the contribution of all the considered modalities, and employing the late fusion, but even mixtures of two modalities (especially audio and video) bring significant benefits.
paper
Deep learning; Multimodal emotion recognition; Multimodal fusion;
English
Image Analysis and Processing – ICIAP 2022 21st International Conference - May 23–27, 2022
2022
Sclaroff, S; Distante, C; Leo, M; Farinella, GM; Tombari, F
Image Analysis and Processing – ICIAP 2022 21st International Conference, Lecce, Italy, May 23–27, 2022, Proceedings, Part II
9783031064296
2022
13232 LNCS
730
741
reserved
Patania, S., D’Amelio, A., Lanzarotti, R. (2022). Exploring Fusion Strategies in Deep Multimodal Affect Prediction. In Image Analysis and Processing – ICIAP 2022 21st International Conference, Lecce, Italy, May 23–27, 2022, Proceedings, Part II (pp.730-741). Springer Verlag [10.1007/978-3-031-06430-2_61].
File in questo prodotto:
File Dimensione Formato  
Patania-2022-ICIAP 2022-VoR.pdf

Solo gestori archivio

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Tutti i diritti riservati
Dimensione 1.15 MB
Formato Adobe PDF
1.15 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/553715
Citazioni
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 1
Social impact