Despite their state-of-the-art capabilities, Large Language Models (LLMs) often suffer from hallucinations, which can compromise their reliability in critical applications. In this work, we propose SAFE, a novel framework for detecting and mitigating hallucinations by leveraging Sparse Autoencoders (SAEs). While hallucination detection techniques and SAEs have been explored independently, their synergistic application in a comprehensive system, particularly for hallucination-aware query enrichment, has not been fully investigated. To validate the effectiveness of SAFE, we evaluate it on two models with available SAEs across four diverse cross-domain datasets designed to assess hallucination problems. Empirical results demonstrate that SAFE consistently improves query generation accuracy and mitigates hallucinations across all datasets, achieving accuracy improvements of up to 29.45%.

Abdaljalil, S., Pallucchini, F., Seveso, A., Kurban, H., Mercorio, F., Serpedin, E. (2025). SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs. In EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025 (pp.9335-9346). Association for Computational Linguistics (ACL) [10.18653/v1/2025.findings-emnlp.496].

SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs

Pallucchini F.;Seveso A.;Mercorio F.;
2025

Abstract

Despite their state-of-the-art capabilities, Large Language Models (LLMs) often suffer from hallucinations, which can compromise their reliability in critical applications. In this work, we propose SAFE, a novel framework for detecting and mitigating hallucinations by leveraging Sparse Autoencoders (SAEs). While hallucination detection techniques and SAEs have been explored independently, their synergistic application in a comprehensive system, particularly for hallucination-aware query enrichment, has not been fully investigated. To validate the effectiveness of SAFE, we evaluate it on two models with available SAEs across four diverse cross-domain datasets designed to assess hallucination problems. Empirical results demonstrate that SAFE consistently improves query generation accuracy and mitigates hallucinations across all datasets, achieving accuracy improvements of up to 29.45%.
paper
Computational linguistics; Natural language processing systems; Query processing
English
30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 - 4 November 2025 - 9 November 2025
2025
Christodoulopoulos, C; Chakraborty, T; Rose, C; Peng, V
EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
9798891763357
2025
9335
9346
none
Abdaljalil, S., Pallucchini, F., Seveso, A., Kurban, H., Mercorio, F., Serpedin, E. (2025). SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs. In EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025 (pp.9335-9346). Association for Computational Linguistics (ACL) [10.18653/v1/2025.findings-emnlp.496].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/594431
Citazioni
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
Social impact