Bicocca Open Archive

Online platforms have increasingly become hotspots to spread not only opinions but also hate speech, posing substantial obstacles to developing constructive and inclusive online communities. In this paper, we propose a novel approach that leverages the integrated gradients of pre-trained language models to automatically predict both hate speech and the potential disagreement that can arise from readers. The integrated gradient attributions are used to shed light on the model's decision-making process attributing importance scores to individual tokens and enabling the identification of crucial factors contributing to disagreement and hate speech classifications. The integrated gradients' straightforwardness allows for the recognition of fundamental causes of disagreements and hate speech content. By adopting an interpretable approach, we bridge the gap between model predictions and human comprehension. Our experimental results highlight the effectiveness of our approach, outperforming traditional BERT models and state-of-the-art methods in both prediction tasks.

Astorino, A., Rizzi, G., Fersini, E. (2023). Integrated Gradients as Proxy of Disagreement in Hateful Content. In Proceedings of the 9th Italian Conference on Computational Linguistics, Nov 30 — Dec 02, 2023, Venice, Italy (pp.1-7). CEUR-WS.

Integrated Gradients as Proxy of Disagreement in Hateful Content

Astorino A.^Primo;Rizzi G.^Secondo;Fersini E.^Ultimo

2023

Abstract

Online platforms have increasingly become hotspots to spread not only opinions but also hate speech, posing substantial obstacles to developing constructive and inclusive online communities. In this paper, we propose a novel approach that leverages the integrated gradients of pre-trained language models to automatically predict both hate speech and the potential disagreement that can arise from readers. The integrated gradient attributions are used to shed light on the model's decision-making process attributing importance scores to individual tokens and enabling the identification of crucial factors contributing to disagreement and hate speech classifications. The integrated gradients' straightforwardness allows for the recognition of fundamental causes of disagreements and hate speech content. By adopting an interpretable approach, we bridge the gap between model predictions and human comprehension. Our experimental results highlight the effectiveness of our approach, outperforming traditional BERT models and state-of-the-art methods in both prediction tasks.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
				paper
			
	Parole chiave
	
				Hateful Content; Integrated Gradients; Learning with Disagreement;
			
	Lingua del contenuto
	
				English
			
	Nome del convegno
	
				9th Italian Conference on Computational Linguistics CLiC-it 2023 - November 30 - December 2, 2023
			
	Anno del convegno
	
				2023
			
	Curatori della monografia
	
				Boschetti, F; Lebani, GE; Magnini, B; Novielli, N
			
	Titolo degli atti
	
				Proceedings of the 9th Italian Conference on Computational Linguistics, Nov 30 — Dec 02, 2023, Venice, Italy
			
	Collana o serie
	
				CEUR WORKSHOP PROCEEDINGS
			
	Data di pubblicazione
	
				2023
			
	Numero del volume
	
				3596
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				7
			
	URL alternativo
	
				https://ceur-ws.org/Vol-3596/
			
	Fulltext
	
				none
			
	Citazione
	
				Astorino, A., Rizzi, G., Fersini, E. (2023). Integrated Gradients as Proxy of Disagreement in Hateful Content. In Proceedings of the 9th Italian Conference on Computational Linguistics, Nov 30 — Dec 02, 2023, Venice, Italy (pp.1-7). CEUR-WS.
			
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/590163

Citazioni

4

ND

Social impact