Speech emotion recognition is a significant source of information especially when other channels, like face or body, are hidden. The shape of the vocal tract, tone of the voice, pitch and other characteristics are influenced by human emotions. In this paper, we propose the use of static, dynamic and acceleration features which are very effective in encoding those characteristics of the speech that are influenced by human emotions. These features are based on the concatenation of three global measures of Mel-frequency Cepstral Coefficients (MFCCs) (the static part) and the first (the dynamic part) and second derivatives (the acceleration part) of MFCCs. The features are processed with a custom 1-D CNN suitable designed by the authors for emotion recognition. Experiments are performed on two publicly available speech datasets containing audio files from different people and language and several emotions. Our approach on average overcomes the state of the art on both datasets.

Khalifa, I., Ejbali, R., Napoletano, P., Schettini, R., Zaied, M. (2022). Static, Dynamic and Acceleration Features for CNN-Based Speech Emotion Recognition. In AIxIA 2021 – Advances in Artificial Intelligence 20th International Conference of the Italian Association for Artificial Intelligence, Virtual Event, December 1–3, 2021, Revised Selected Papers (pp.348-358). Springer [10.1007/978-3-031-08421-8_24].

Static, Dynamic and Acceleration Features for CNN-Based Speech Emotion Recognition

Khalifa I.;Napoletano P.
;
Schettini R.;
2022

Abstract

Speech emotion recognition is a significant source of information especially when other channels, like face or body, are hidden. The shape of the vocal tract, tone of the voice, pitch and other characteristics are influenced by human emotions. In this paper, we propose the use of static, dynamic and acceleration features which are very effective in encoding those characteristics of the speech that are influenced by human emotions. These features are based on the concatenation of three global measures of Mel-frequency Cepstral Coefficients (MFCCs) (the static part) and the first (the dynamic part) and second derivatives (the acceleration part) of MFCCs. The features are processed with a custom 1-D CNN suitable designed by the authors for emotion recognition. Experiments are performed on two publicly available speech datasets containing audio files from different people and language and several emotions. Our approach on average overcomes the state of the art on both datasets.
paper
CNNs; dynamic and acceleration features; Spectral features; Speech emotion recognition; Static;
English
20th International Conference of the Italian Association for Artificial Intelligence, AIxIA 2021 - 1 December 2021 through 3 December 2021
2021
Bandini, S; Gasparini, F; Mascardi, V; Palmonari, M; Vizzari, G
AIxIA 2021 – Advances in Artificial Intelligence 20th International Conference of the Italian Association for Artificial Intelligence, Virtual Event, December 1–3, 2021, Revised Selected Papers
978-3-031-08420-1
2022
13196 LNAI
348
358
none
Khalifa, I., Ejbali, R., Napoletano, P., Schettini, R., Zaied, M. (2022). Static, Dynamic and Acceleration Features for CNN-Based Speech Emotion Recognition. In AIxIA 2021 – Advances in Artificial Intelligence 20th International Conference of the Italian Association for Artificial Intelligence, Virtual Event, December 1–3, 2021, Revised Selected Papers (pp.348-358). Springer [10.1007/978-3-031-08421-8_24].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/414278
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
Social impact