Bicocca Open Archive

Despite their state-of-the-art capabilities, Large Language Models (LLMs) often suffer from hallucinations, which can compromise their reliability in critical applications. In this work, we propose SAFE, a novel framework for detecting and mitigating hallucinations by leveraging Sparse Autoencoders (SAEs). While hallucination detection techniques and SAEs have been explored independently, their synergistic application in a comprehensive system, particularly for hallucination-aware query enrichment, has not been fully investigated. To validate the effectiveness of SAFE, we evaluate it on two models with available SAEs across four diverse cross-domain datasets designed to assess hallucination problems. Empirical results demonstrate that SAFE consistently improves query generation accuracy and mitigates hallucinations across all datasets, achieving accuracy improvements of up to 29.45%.

Abdaljalil, S., Pallucchini, F., Seveso, A., Kurban, H., Mercorio, F., Serpedin, E. (2025). SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs. In EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025 (pp.9335-9346). Association for Computational Linguistics (ACL) [10.18653/v1/2025.findings-emnlp.496].

SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs

Abdaljalil S.;Pallucchini F.;Seveso A.;Kurban H.;Mercorio F.;Serpedin E.

2025

Abstract

Despite their state-of-the-art capabilities, Large Language Models (LLMs) often suffer from hallucinations, which can compromise their reliability in critical applications. In this work, we propose SAFE, a novel framework for detecting and mitigating hallucinations by leveraging Sparse Autoencoders (SAEs). While hallucination detection techniques and SAEs have been explored independently, their synergistic application in a comprehensive system, particularly for hallucination-aware query enrichment, has not been fully investigated. To validate the effectiveness of SAFE, we evaluate it on two models with available SAEs across four diverse cross-domain datasets designed to assess hallucination problems. Empirical results demonstrate that SAFE consistently improves query generation accuracy and mitigates hallucinations across all datasets, achieving accuracy improvements of up to 29.45%.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
				paper
			
	Parole chiave
	
				Computational linguistics; Natural language processing systems; Query processing
			
	Lingua del contenuto
	
				English
			
	Nome del convegno
	
				30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 - 4 November 2025 - 9 November 2025
			
	Anno del convegno
	
				2025
			
	Curatori della monografia
	
				Christodoulopoulos, C; Chakraborty, T; Rose, C; Peng, V
			
	Titolo degli atti
	
				EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
			
	ISBN del volume degli atti
	
				9798891763357
			
	Data di pubblicazione
	
				2025
			
	Pagina iniziale
	
				9335
			
	Pagina finale
	
				9346
			
	DOI dell'intervento
	
				https://dx.doi.org/10.18653/v1/2025.findings-emnlp.496
			
	Fulltext
	
				none
			
	Citazione
	
				Abdaljalil, S., Pallucchini, F., Seveso, A., Kurban, H., Mercorio, F., Serpedin, E. (2025). SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs. In EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025 (pp.9335-9346). Association for Computational Linguistics (ACL) [10.18653/v1/2025.findings-emnlp.496].
			
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/594431

Citazioni

1

ND

Social impact