Bicocca Open Archive

In today's digital landscape, users frequently share vast amounts of information, including confidential data, often without full awareness of the associated privacy risks. This scenario highlights the need for automated methods to identify sensitive information and alert users to such risks. Existing algorithmic solutions for detecting sensitive content typically require either human intervention (rule-based approaches) or labeled data (supervised learning), both of which can be costly and limiting. In this paper, we propose a framework based on Retrieval-Augmented Generation (RAG) to classify privacy-sensitive content while providing contextual explanations. We employed the state-of-the-art generative Large Language Model (LLM) GPT-4o, with Information Retrieval models BM25 and FAISS, enhancing both detection accuracy and explainability. Our method utilizes a curated Knowledge Base of scientific literature on privacy and confidentiality to retrieve contextually relevant information, which is then used to guide the classification process and generate explanations. Experimental evaluations on a real-world dataset (Enron Email Dataset) demonstrate that RAG-based approaches significantly outperform the zero-shot baseline, with BM25 showing the highest performance. This tool is designed to serve end-users, by mitigating risks before data sharing, by enabling proactive monitoring of privacy violations.

Locci, S., Audrito, D., Livraga, G., Viviani, M., Di Caro, L. (2025). Leveraging RAG for Privacy Violation Detection and Explainability. In 2025 International Joint Conference on Neural Networks (IJCNN) (pp.1-7). Institute of Electrical and Electronics Engineers Inc. [10.1109/IJCNN64981.2025.11228403].

Leveraging RAG for Privacy Violation Detection and Explainability

Locci S.;Audrito D.;Livraga G.;Viviani M.;Di Caro L.

2025

Abstract

In today's digital landscape, users frequently share vast amounts of information, including confidential data, often without full awareness of the associated privacy risks. This scenario highlights the need for automated methods to identify sensitive information and alert users to such risks. Existing algorithmic solutions for detecting sensitive content typically require either human intervention (rule-based approaches) or labeled data (supervised learning), both of which can be costly and limiting. In this paper, we propose a framework based on Retrieval-Augmented Generation (RAG) to classify privacy-sensitive content while providing contextual explanations. We employed the state-of-the-art generative Large Language Model (LLM) GPT-4o, with Information Retrieval models BM25 and FAISS, enhancing both detection accuracy and explainability. Our method utilizes a curated Knowledge Base of scientific literature on privacy and confidentiality to retrieve contextually relevant information, which is then used to guide the classification process and generate explanations. Experimental evaluations on a real-world dataset (Enron Email Dataset) demonstrate that RAG-based approaches significantly outperform the zero-shot baseline, with BM25 showing the highest performance. This tool is designed to serve end-users, by mitigating risks before data sharing, by enabling proactive monitoring of privacy violations.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
				slide + paper
			
	Parole chiave
	
				Information Retrieval (IR); Knowledge Bases (KBs); Large Language Models (LLMs); Privacy; Retrieval-Augmented Generation (RAG);
			
	Lingua del contenuto
	
				English
			
	Nome del convegno
	
				2025 International Joint Conference on Neural Networks (IJCNN) - 30 June 2025 - 05 July 2025
			
	Anno del convegno
	
				2025
			
	Titolo degli atti
	
				2025 International Joint Conference on Neural Networks (IJCNN)
			
	ISBN del volume degli atti
	
				9798331510428
			
	Collana o serie
	
				PROCEEDINGS OF ... INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS
			
	Data di pubblicazione
	
				2025
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				7
			
	DOI dell'intervento
	
				https://dx.doi.org/10.1109/IJCNN64981.2025.11228403
			
	Fulltext
	
				none
			
	Citazione
	
				Locci, S., Audrito, D., Livraga, G., Viviani, M., Di Caro, L. (2025). Leveraging RAG for Privacy Violation Detection and Explainability. In 2025 International Joint Conference on Neural Networks (IJCNN) (pp.1-7). Institute of Electrical and Electronics Engineers Inc. [10.1109/IJCNN64981.2025.11228403].
			
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/583941

Citazioni

0

0

Social impact