The rapid expansion of digital data has intensified the need for computational methods capable of analyzing complex latent structures across a variety of domains, including textual data. Latent topic models, particularly latent Dirichlet allocation (LDA), are widely used to uncover latent structures in large text corpora. However, the Dirichlet prior on topic proportions imposes structural limitations that reduce the model’s ability to capture complex dependencies among topics. In this paper, we introduce the extended flexible latent Dirichlet allocation (EFLDA), a probabilistic model that extends LDA by allowing richer patterns of dependence among topics. The enriched parametrization of EFLDA improves the model’s ability to represent complex thematic structures, leading to great interpretability in real-world settings. Furthermore, we introduce the concept of sub-topics, defined as specific combinations of topics that provide a deeper understanding of corpora. We develop a collapsed Gibbs sampler for efficient inference and conduct an extensive evaluation on both synthetic data and multiple real-world applications, including mental health discourse, news articles, and microbiome data. Empirical results show that EFLDA outperforms classical LDA and recent alternative approaches in terms of topic coherence, sub-topic detection, and interpretability, while remaining robust across heterogeneous data settings characterized by complex and overlapping latent structures.
Ascari, R., Giampino, A., Migliorati, S. (2026). Sub-topics detection with extended flexible latent Dirichlet allocation. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION [10.1007/s11634-026-00690-9].
Sub-topics detection with extended flexible latent Dirichlet allocation
Ascari, Roberto
;Giampino, Alice;Migliorati, Sonia
2026
Abstract
The rapid expansion of digital data has intensified the need for computational methods capable of analyzing complex latent structures across a variety of domains, including textual data. Latent topic models, particularly latent Dirichlet allocation (LDA), are widely used to uncover latent structures in large text corpora. However, the Dirichlet prior on topic proportions imposes structural limitations that reduce the model’s ability to capture complex dependencies among topics. In this paper, we introduce the extended flexible latent Dirichlet allocation (EFLDA), a probabilistic model that extends LDA by allowing richer patterns of dependence among topics. The enriched parametrization of EFLDA improves the model’s ability to represent complex thematic structures, leading to great interpretability in real-world settings. Furthermore, we introduce the concept of sub-topics, defined as specific combinations of topics that provide a deeper understanding of corpora. We develop a collapsed Gibbs sampler for efficient inference and conduct an extensive evaluation on both synthetic data and multiple real-world applications, including mental health discourse, news articles, and microbiome data. Empirical results show that EFLDA outperforms classical LDA and recent alternative approaches in terms of topic coherence, sub-topic detection, and interpretability, while remaining robust across heterogeneous data settings characterized by complex and overlapping latent structures.| File | Dimensione | Formato | |
|---|---|---|---|
|
Ascari et al-2026-Adv Data Anal Classif-VoR.pdf
accesso aperto
Descrizione: EFLDA_ADAC
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Licenza:
Creative Commons
Dimensione
4.31 MB
Formato
Adobe PDF
|
4.31 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


