Bicocca Open Archive

Understanding source code is a topic of great interest in the software engineering community, since it can help programmers in various tasks such as software maintenance and reuse. Recent advances in large language models (LLMs) have demonstrated remarkable program comprehension capabilities, while transformer-based topic modeling techniques offer effective ways to extract semantic information from text. This paper proposes and explores a novel approach that combines these strengths to automatically identify meaningful topics in a corpus of Python programs. Our method consists in applying topic modeling on the descriptions obtained by asking an LLM to summarize the code. To assess the internal consistency of the extracted topics, we compare them against topics inferred from function names alone, and those derived from existing docstrings. Experimental results suggest that leveraging LLM-generated summaries provides interpretable and semantically rich representation of code structure. The promising results suggest that our approach can be fruitfully applied in various software engineering tasks such as automatic documentation and tagging, code search, software reorganization and knowledge discovery in large repositories.

Carissimi, M., Saletta, M., Ferretti, C. (2025). Towards Leveraging Large Language Model Summaries for Topic Modeling in Source Code. In EASE '25: Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering (pp.776-781). Association for Computing Machinery, Inc [10.1145/3756681.3757026].

Towards Leveraging Large Language Model Summaries for Topic Modeling in Source Code

Carissimi, Michele;Saletta, Martina;Ferretti, Claudio

2025

Abstract

Understanding source code is a topic of great interest in the software engineering community, since it can help programmers in various tasks such as software maintenance and reuse. Recent advances in large language models (LLMs) have demonstrated remarkable program comprehension capabilities, while transformer-based topic modeling techniques offer effective ways to extract semantic information from text. This paper proposes and explores a novel approach that combines these strengths to automatically identify meaningful topics in a corpus of Python programs. Our method consists in applying topic modeling on the descriptions obtained by asking an LLM to summarize the code. To assess the internal consistency of the extracted topics, we compare them against topics inferred from function names alone, and those derived from existing docstrings. Experimental results suggest that leveraging LLM-generated summaries provides interpretable and semantically rich representation of code structure. The promising results suggest that our approach can be fruitfully applied in various software engineering tasks such as automatic documentation and tagging, code search, software reorganization and knowledge discovery in large repositories.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
				paper
			
	Parole chiave
	
				source code analysis; source code concept location; topic modeling; transformers;
			
	Lingua del contenuto
	
				English
			
	Nome del convegno
	
				29th International Conference on Evaluation and Assessment in Software Engineering - June 17 - 20, 2025
			
	Anno del convegno
	
				2025
			
	Curatori della monografia
	
				Babar, AM; Tosun, A; Wagner, S; Stray, V
			
	Titolo degli atti
	
				EASE '25: Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering
			
	ISBN del volume degli atti
	
				9798400713859
			
	Data di pubblicazione
	
				2025
			
	Pagina iniziale
	
				776
			
	Pagina finale
	
				781
			
	DOI dell'intervento
	
				https://dx.doi.org/10.1145/3756681.3757026
			
	Fulltext
	
				open
			
	Citazione
	
				Carissimi, M., Saletta, M., Ferretti, C. (2025). Towards Leveraging Large Language Model Summaries for Topic Modeling in Source Code. In EASE '25: Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering (pp.776-781). Association for Computing Machinery, Inc [10.1145/3756681.3757026].
			
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

File	Dimensione	Formato
Carissimi-2025-EASE-VoR.pdf accesso aperto Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 1.5 MB Formato Adobe PDF Visualizza/Apri	1.5 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/590821

Citazioni

0

0

Social impact