Although several researches have been carried out in the field of Speech Emotion Recognition (SER), only few of them consider people of different ages or languages. In particular, most of the SER datasets reported in the literature are collected from young adults or take into account a single language, such as English or Chinese. These datasets tend to be poorly heterogeneous and dependent on the context in which they are collected. In general they are composed of acted utterances or they are recorded in situations properly designed to evoke certain emotions. This paper proposes a framework that allows to benefit of complementary information coming from multisource data to train a general SER model. To merge different sources, proper preprocessing steps to normalize the data source, the type of recorded speeches, and the subjects considered are here described. Furthermore we present a domain adaptation strategy that allows to benefit of the general model adapting it to a certain language and/or a certain population age. In particular here we are interested in developing SER models that consider Italian older adults. Preliminary results that consider several sources for training and different language as test set confirm the validity of the proposal.
Grossi, A., Fratti, G., Gasparini, F. (2023). A computational framework for speech emotion recognition in case of multisource data. In Proceedings of the 4th Italian Workshop on Artificial Intelligence for an Ageing Society co-located with 22nd International Conference of the Italian Association for Artificial Intelligence (AIxIA 2023) (pp.113-126). CEUR-WS.
A computational framework for speech emotion recognition in case of multisource data
Grossi A.;Fratti G.;Gasparini F.
2023
Abstract
Although several researches have been carried out in the field of Speech Emotion Recognition (SER), only few of them consider people of different ages or languages. In particular, most of the SER datasets reported in the literature are collected from young adults or take into account a single language, such as English or Chinese. These datasets tend to be poorly heterogeneous and dependent on the context in which they are collected. In general they are composed of acted utterances or they are recorded in situations properly designed to evoke certain emotions. This paper proposes a framework that allows to benefit of complementary information coming from multisource data to train a general SER model. To merge different sources, proper preprocessing steps to normalize the data source, the type of recorded speeches, and the subjects considered are here described. Furthermore we present a domain adaptation strategy that allows to benefit of the general model adapting it to a certain language and/or a certain population age. In particular here we are interested in developing SER models that consider Italian older adults. Preliminary results that consider several sources for training and different language as test set confirm the validity of the proposal.File | Dimensione | Formato | |
---|---|---|---|
Grossi-2023-AIxAS-VoR.pdf
accesso aperto
Descrizione: Intervento a convegno - AIxAS 2023 paper 11
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Licenza:
Creative Commons
Dimensione
1.09 MB
Formato
Adobe PDF
|
1.09 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.