Speech is one of the most natural way for human to express emotions. Emotions play also a pivotal role in human-machine interaction allowing a more natural communication between the user and the system. This led in recent years to a growing interest in Speech Emotion Recognition (SER). Most of the SER classification models are based on specific domains and are not easily generalizable to other situations or use cases. For instance, most of the datasets available in the literature are acted utterances collected from English adults and thus not easily generalizable to other languages or ages. In this context, defining a SER system that can be easily generalised to new subjects or languages has become a topic of relevant importance. The main aim of this article is to analyze the challenges and limitations of using acted datasets to define a general model. As preliminary analysis, two different pre-processing and features extraction pipelines have been evaluated for SER models able to recognize emotions from three well-known acted datasets. Then, the model that achieved the best performance was applied to new data collected in more realistic environments. The training dataset is Emozionalmente, a large italian acted dataset collected using a crowdsourced platform from non-professional actors. This model was tested on two subsets of the Italian speech emotion dataset SER_AMPEL, to evaluate the performance in a more realistic context. The first subset comprises audio clips from movies and TV series performed by older adult dubbers, while the second one consists of natural conversations among individuals of different ages. The analysis of performances and results have highlighted the main difficulties and challenges in generalize a model trained on acoustic features to new real data. In particular, this preliminary analysis has shown the limits of using acted dataset to recognize emotion in real environment.

Grossi, A., Milella, A., Mattia, R., Gasparini, F. (2024). Are we all good actors? A study on the feasibility of generalizing speech emotion recognition models. In Proceedings of the Third Workshop on Artificial Intelligence for Human-Machine Interaction (AIxHMI 2024) co-located with the 23rd International Conference of the Italian Association for Artificial Intelligence (AI* IA 2024) (pp.83-92). CEUR-WS.

Are we all good actors? A study on the feasibility of generalizing speech emotion recognition models

Grossi A.
;
Mattia R.;Gasparini F.
2024

Abstract

Speech is one of the most natural way for human to express emotions. Emotions play also a pivotal role in human-machine interaction allowing a more natural communication between the user and the system. This led in recent years to a growing interest in Speech Emotion Recognition (SER). Most of the SER classification models are based on specific domains and are not easily generalizable to other situations or use cases. For instance, most of the datasets available in the literature are acted utterances collected from English adults and thus not easily generalizable to other languages or ages. In this context, defining a SER system that can be easily generalised to new subjects or languages has become a topic of relevant importance. The main aim of this article is to analyze the challenges and limitations of using acted datasets to define a general model. As preliminary analysis, two different pre-processing and features extraction pipelines have been evaluated for SER models able to recognize emotions from three well-known acted datasets. Then, the model that achieved the best performance was applied to new data collected in more realistic environments. The training dataset is Emozionalmente, a large italian acted dataset collected using a crowdsourced platform from non-professional actors. This model was tested on two subsets of the Italian speech emotion dataset SER_AMPEL, to evaluate the performance in a more realistic context. The first subset comprises audio clips from movies and TV series performed by older adult dubbers, while the second one consists of natural conversations among individuals of different ages. The analysis of performances and results have highlighted the main difficulties and challenges in generalize a model trained on acoustic features to new real data. In particular, this preliminary analysis has shown the limits of using acted dataset to recognize emotion in real environment.
paper
acoustic features; acted dataset; cross age SER; cross-corpus SER; model generalization; Speech emotion recognition;
English
Third Workshop on Artificial Intelligence for Human-Machine Interaction (AIxHMI 2024) co-located with the 23rd International Conference of the Italian Association for Artificial Intelligence (AI* IA 2024) - November 26, 2024
2024
Saibene, A; Corchs, S; Fontana, S.; Solé-Casals, J
Proceedings of the Third Workshop on Artificial Intelligence for Human-Machine Interaction (AIxHMI 2024) co-located with the 23rd International Conference of the Italian Association for Artificial Intelligence (AI* IA 2024)
2024
3903
83
92
https://ceur-ws.org/Vol-3903/
open
Grossi, A., Milella, A., Mattia, R., Gasparini, F. (2024). Are we all good actors? A study on the feasibility of generalizing speech emotion recognition models. In Proceedings of the Third Workshop on Artificial Intelligence for Human-Machine Interaction (AIxHMI 2024) co-located with the 23rd International Conference of the Italian Association for Artificial Intelligence (AI* IA 2024) (pp.83-92). CEUR-WS.
File in questo prodotto:
File Dimensione Formato  
Grossi-2024-AIxHMI-CEUR-WS-VoR.pdf

accesso aperto

Descrizione: This volume and its papers are published under the Creative Commons License Attribution 4.0 International (CC BY 4.0).
Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Creative Commons
Dimensione 242.26 kB
Formato Adobe PDF
242.26 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/545061
Citazioni
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
Social impact