Sparse AutoEncoders (SAEs) have recently been employed as a promising unsupervised approach for understanding the representations of layers of Large Language Models (LLMs). However, with the growth in model size and complexity, training SAEs is computationally intensive, as typically one SAE is trained for each model layer. To address such limitation, we propose Group-SAE, a novel strategy to train SAEs. Our method considers the similarity of the residual stream representations between contiguous layers to group similar layers and train a single SAE per group. To balance the trade-off between efficiency and performance, we further introduce AMAD (Average Maximum Angular Distance), an empirical metric that guides the selection of an optimal number of groups based on representational similarity across layers. Experiments on models from the Pythia family show that our approach significantly accelerates training with minimal impact on reconstruction quality and comparable downstream task performance and interpretability over baseline SAEs trained layer by layer. This method provides an efficient and scalable strategy for training SAEs in modern LLMs.

Ghilardi, D., Belotti, F., Molinari, M., Ma, T., Palmonari, M. (2025). Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (pp.18668-18688) [10.18653/v1/2025.emnlp-main.942].

Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups

Ghilardi, Davide;Belotti, Federico;Palmonari, Matteo
2025

Abstract

Sparse AutoEncoders (SAEs) have recently been employed as a promising unsupervised approach for understanding the representations of layers of Large Language Models (LLMs). However, with the growth in model size and complexity, training SAEs is computationally intensive, as typically one SAE is trained for each model layer. To address such limitation, we propose Group-SAE, a novel strategy to train SAEs. Our method considers the similarity of the residual stream representations between contiguous layers to group similar layers and train a single SAE per group. To balance the trade-off between efficiency and performance, we further introduce AMAD (Average Maximum Angular Distance), an empirical metric that guides the selection of an optimal number of groups based on representational similarity across layers. Experiments on models from the Pythia family show that our approach significantly accelerates training with minimal impact on reconstruction quality and comparable downstream task performance and interpretability over baseline SAEs trained layer by layer. This method provides an efficient and scalable strategy for training SAEs in modern LLMs.
paper
Artificial Intelligence, Large Language Models, interpretability, efficient ML
English
The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) - November 4-9, 2025
2025
Christodoulopoulos, C; Chakraborty, T; Rose, C; Peng, V
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
9798891763326
2025
18668
18688
open
Ghilardi, D., Belotti, F., Molinari, M., Ma, T., Palmonari, M. (2025). Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (pp.18668-18688) [10.18653/v1/2025.emnlp-main.942].
File in questo prodotto:
File Dimensione Formato  
Ghilardi-2025-EMNLP 2025-VoR.pdf

accesso aperto

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Creative Commons
Dimensione 2.7 MB
Formato Adobe PDF
2.7 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/583203
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact