Online platforms have increasingly become hotspots to spread not only opinions but also hate speech, posing substantial obstacles to developing constructive and inclusive online communities. In this paper, we propose a novel approach that leverages the integrated gradients of pre-trained language models to automatically predict both hate speech and the potential disagreement that can arise from readers. The integrated gradient attributions are used to shed light on the model's decision-making process attributing importance scores to individual tokens and enabling the identification of crucial factors contributing to disagreement and hate speech classifications. The integrated gradients' straightforwardness allows for the recognition of fundamental causes of disagreements and hate speech content. By adopting an interpretable approach, we bridge the gap between model predictions and human comprehension. Our experimental results highlight the effectiveness of our approach, outperforming traditional BERT models and state-of-the-art methods in both prediction tasks.

Astorino, A., Rizzi, G., Fersini, E. (2023). Integrated Gradients as Proxy of Disagreement in Hateful Content. In Proceedings of the 9th Italian Conference on Computational Linguistics, Nov 30 — Dec 02, 2023, Venice, Italy (pp.1-7). CEUR-WS.

Integrated Gradients as Proxy of Disagreement in Hateful Content

Astorino A.
Primo
;
Rizzi G.
Secondo
;
Fersini E.
Ultimo
2023

Abstract

Online platforms have increasingly become hotspots to spread not only opinions but also hate speech, posing substantial obstacles to developing constructive and inclusive online communities. In this paper, we propose a novel approach that leverages the integrated gradients of pre-trained language models to automatically predict both hate speech and the potential disagreement that can arise from readers. The integrated gradient attributions are used to shed light on the model's decision-making process attributing importance scores to individual tokens and enabling the identification of crucial factors contributing to disagreement and hate speech classifications. The integrated gradients' straightforwardness allows for the recognition of fundamental causes of disagreements and hate speech content. By adopting an interpretable approach, we bridge the gap between model predictions and human comprehension. Our experimental results highlight the effectiveness of our approach, outperforming traditional BERT models and state-of-the-art methods in both prediction tasks.
paper
Hateful Content; Integrated Gradients; Learning with Disagreement;
English
9th Italian Conference on Computational Linguistics CLiC-it 2023 - November 30 - December 2, 2023
2023
Boschetti, F; Lebani, GE; Magnini, B; Novielli, N
Proceedings of the 9th Italian Conference on Computational Linguistics, Nov 30 — Dec 02, 2023, Venice, Italy
2023
3596
1
7
https://ceur-ws.org/Vol-3596/
none
Astorino, A., Rizzi, G., Fersini, E. (2023). Integrated Gradients as Proxy of Disagreement in Hateful Content. In Proceedings of the 9th Italian Conference on Computational Linguistics, Nov 30 — Dec 02, 2023, Venice, Italy (pp.1-7). CEUR-WS.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/590163
Citazioni
  • Scopus 4
  • ???jsp.display-item.citation.isi??? ND
Social impact