Interpreting the internal representations of large language models (LLMs) is crucial for their deployment in real-world applications, impacting areas such as AI safety, debugging, and compliance. Sparse Autoencoders facilitate interpretability by decomposing polysemantic activation into a latent space of monosemantic features. However, evaluating the auto-interpretability of these features is difficult and computationally expensive, which limits scalability in practical settings. In this work, we propose SFAL, an alternative evaluation strategy that reduces reliance on LLM-based scoring by assessing the alignment between the semantic neighbourhoods of features (derived from auto-interpretation embeddings) and their functional neighbourhoods (derived from co-occurrence statistics).Our method enhances efficiency, enabling fast and cost-effective assessments. We validate our approach on large-scale models, demonstrating its potential to provide interpretability while reducing computational overhead, making it suitable for real-world deployment.

Mercorio, F., Pallucchini, F., Potertì, D., Serino, A., Seveso, A. (2025). SFAL: Semantic-Functional Alignment Scores for Distributional Evaluation of Auto-Interpretability in Sparse Autoencoders. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track (pp.576-583). Association for Computational Linguistics.

SFAL: Semantic-Functional Alignment Scores for Distributional Evaluation of Auto-Interpretability in Sparse Autoencoders

Mercorio, F;Pallucchini, F;Potertì, D;Serino, A;Seveso, A
2025

Abstract

Interpreting the internal representations of large language models (LLMs) is crucial for their deployment in real-world applications, impacting areas such as AI safety, debugging, and compliance. Sparse Autoencoders facilitate interpretability by decomposing polysemantic activation into a latent space of monosemantic features. However, evaluating the auto-interpretability of these features is difficult and computationally expensive, which limits scalability in practical settings. In this work, we propose SFAL, an alternative evaluation strategy that reduces reliance on LLM-based scoring by assessing the alignment between the semantic neighbourhoods of features (derived from auto-interpretation embeddings) and their functional neighbourhoods (derived from co-occurrence statistics).Our method enhances efficiency, enabling fast and cost-effective assessments. We validate our approach on large-scale models, demonstrating its potential to provide interpretability while reducing computational overhead, making it suitable for real-world deployment.
paper
SAE; AI; NLP; LLM
English
The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) - from November 4th to November 9th, 2025
2025
Potdar, S; Rojas-Barahona, L; Montella, S
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
9798891763333
2025
576
583
https://aclanthology.org/2025.emnlp-industry.39/
none
Mercorio, F., Pallucchini, F., Potertì, D., Serino, A., Seveso, A. (2025). SFAL: Semantic-Functional Alignment Scores for Distributional Evaluation of Auto-Interpretability in Sparse Autoencoders. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track (pp.576-583). Association for Computational Linguistics.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/574341
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact