One common task in Natural Language Processing (NLP) is topics identification, which involves recognizing the topic(s) of a text. Among the automatic solutions (in contrast to engines developed and maintained by linguistic experts), there are two main approaches: Statistical Learning Models (SLM) trained on supervised datasets, capable of identifying real topics and Topic Models (TM), capable of identifying latent topics in unsupervised corpora of documents. In general, in topic identification research, it is always challenging to find a high-quality training dataset with a known mixture of topics for each text and so that topics come from a taxonomy that covers all possible subjects. A dataset of this kind, preferably extensive and easy updatable, could be of enormous value to train supervised models or to validate the results of various types of models. Furthermore, TMs have proven to be highly effective in numerous tests since the introduction of the Latent Dirichlet Allocation (LDA) model. While many variants and advancements have been developed in recent years, they all face two issues. Firstly, it is difficult to comprehend what are the ”meaning” the identified latent topics. To address this, several methods for labeling these latent topics have been proposed. Secondly, comparing different TMs is tricky because there is no direct relationship between the topics of one model and those of another. Consequently today we have only been able to rely on ”self-referential” indicators or manual verification. In this PhD research many novel methodologies are proposed on these three challenges: two methodologies for creating a large corpus of documents with well-defined mix of topics, four methods for labeling latent topics using this corpus with a supervised approach, six metrics used for performances evaluation of topic models in this context. These three significant advancements allows to get the main contribution of this research: to establish a rigorous methodological framework compare different TMs on a common and objective ”arena”, providing the opportunity for quantitative performance comparisons, particularly in terms of their ability to accurately identify the actual mix of real topics in documents. Several experiments have been conducted to validate the effectiveness of this approach. Firstly, extensive comparisons of the ability to identify topics in unknown documents have been carried out between the methodology proposed in this study and random models on one side and supervised statistical learning models on the other side. This was done to ensure that the proposed solution yields reliable outcomes, and the results indeed confirm the correctness of the proposed methodology. Secondly, multiple comparisons of four TMs, the already cited LDA, Correlated Topic Model (CTM), Hierarchical Dirichlet Process (HDP) and Pachinko Allocation Model (PAM) have been conducted measuring how well they identify the real topics using the proposed methodology. This was assessed using both classical indicators of classification (accuracy, precision, and recall) and all of new metrics proposed in this work. Last but not least, as a byproduct, a new SLM based on TM has been developed, capable of competing with established ones. This could serve as a viable alternative, given its low computational demands and its production of additional information that can be valuable for refining taxonomies. Consequently in the last part of the research a tuning of the hyperparameters of the best TM emerged from the comparison tests has been performed. At the end, with that optimal settings, a test over a huge dataset of 6 Millions of documents has been conducted.

In generale, nella ricerca sull'identificazione degli argomenti, è sempre difficile trovare un set di dati di addestramento di alta qualità con una miscela nota di argomenti per ciascun testo e in modo che gli argomenti provengano da una tassonomia che copra tutti i possibili argomenti. Un dataset di questo tipo, preferibilmente esteso e facilmente aggiornabile, sarebbe di enorme valore per addestrare modelli supervisionati o per validare i risultati di vari tipi di modelli. Inoltre, i TM hanno dimostrato di essere altamente efficaci in numerosi test dall’introduzione del modello Latent Dirichlet Allocation (LDA). Sebbene negli ultimi anni siano state sviluppate molte varianti e evoluzioni, tutti affrontano due problemi. In primo luogo, è difficile comprendere quale sia il “significato” degli argomenti latenti identificati. Per risolvere questo problema, sono stati proposti diversi metodi per etichettare questi argomenti latenti. In secondo luogo, confrontare diversi TM è complicato perché non esiste una relazione diretta tra gli argomenti di un modello e quelli di un altro. Di conseguenza oggi possiamo contare solo su indicatori “autoreferenziali” o su verifiche manuali. In questa ricerca di dottorato vengono proposte molte nuove metodologie su queste tre sfide: due metodologie per creare un ampio corpus di documenti con un mix ben definito di argomenti, quattro metodi per etichettare argomenti latenti utilizzando questo corpus con un approccio supervisionato, sei metriche utilizzate per la valutazione delle prestazioni dei modelli tematici in questo contesto. Questi tre apporti significativi consentono di ottenere il contributo principale di questa ricerca: stabilire un quadro metodologico rigoroso per confrontare diversi TM su un’”arena” comune e oggettiva, fornendo l’opportunità di confronti quantitativi delle prestazioni, in particolare sulla loro capacità di identificare accuratamente i mix effettivo di argomenti reali nei documenti. Sono stati condotti diversi esperimenti per validare l’efficacia di questo approccio. In primo luogo, sono stati effettuati ampi confronti sulla capacità di identificare argomenti in documenti sconosciuti tra la metodologia proposta in questo studio e i modelli casuali da un lato e i modelli di apprendimento statistico supervisionato dall’altro. Ciò è stato fatto per garantire che la soluzione proposta produca risultati affidabili e che i risultati confermino effettivamente la correttezza della metodologia proposta. In secondo luogo, sono stati confrontati quattro TM, il già citato LDA, il Corlated Topic Model (CTM), l'Hierarchical Dirichlet Process (HDP) e il Pachinko Allocation Model (PAM), misurando quanto bene identificano gli argomenti reali utilizzando la metodologia proposta. Questo è stato valutato utilizzando sia gli indicatori classici di classificazione (accuracy, precision e recall) sia tutte le nuove metriche proposte in questo lavoro. Come corollario è stato realizzato un nuovo SLM basato sul TM, in grado di competere con quelli consolidati: una valida alternativa grazie alle sue basse esigenze computazionali e la produzione di informazioni aggiuntive preziose per perfezionare le tassonomie. Nell'ultima parte è stata effettuata una messa a punto degli iperparametri del miglior TM emerso dai test comparativi, e con quei parametri è stato condotto un test su 6 milioni di documenti.

(2024). A novel methodology to make topic models predict real topics and to compare them in big data corpus. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2024).

A novel methodology to make topic models predict real topics and to compare them in big data corpus

GERLI, SILVIO
2024

Abstract

One common task in Natural Language Processing (NLP) is topics identification, which involves recognizing the topic(s) of a text. Among the automatic solutions (in contrast to engines developed and maintained by linguistic experts), there are two main approaches: Statistical Learning Models (SLM) trained on supervised datasets, capable of identifying real topics and Topic Models (TM), capable of identifying latent topics in unsupervised corpora of documents. In general, in topic identification research, it is always challenging to find a high-quality training dataset with a known mixture of topics for each text and so that topics come from a taxonomy that covers all possible subjects. A dataset of this kind, preferably extensive and easy updatable, could be of enormous value to train supervised models or to validate the results of various types of models. Furthermore, TMs have proven to be highly effective in numerous tests since the introduction of the Latent Dirichlet Allocation (LDA) model. While many variants and advancements have been developed in recent years, they all face two issues. Firstly, it is difficult to comprehend what are the ”meaning” the identified latent topics. To address this, several methods for labeling these latent topics have been proposed. Secondly, comparing different TMs is tricky because there is no direct relationship between the topics of one model and those of another. Consequently today we have only been able to rely on ”self-referential” indicators or manual verification. In this PhD research many novel methodologies are proposed on these three challenges: two methodologies for creating a large corpus of documents with well-defined mix of topics, four methods for labeling latent topics using this corpus with a supervised approach, six metrics used for performances evaluation of topic models in this context. These three significant advancements allows to get the main contribution of this research: to establish a rigorous methodological framework compare different TMs on a common and objective ”arena”, providing the opportunity for quantitative performance comparisons, particularly in terms of their ability to accurately identify the actual mix of real topics in documents. Several experiments have been conducted to validate the effectiveness of this approach. Firstly, extensive comparisons of the ability to identify topics in unknown documents have been carried out between the methodology proposed in this study and random models on one side and supervised statistical learning models on the other side. This was done to ensure that the proposed solution yields reliable outcomes, and the results indeed confirm the correctness of the proposed methodology. Secondly, multiple comparisons of four TMs, the already cited LDA, Correlated Topic Model (CTM), Hierarchical Dirichlet Process (HDP) and Pachinko Allocation Model (PAM) have been conducted measuring how well they identify the real topics using the proposed methodology. This was assessed using both classical indicators of classification (accuracy, precision, and recall) and all of new metrics proposed in this work. Last but not least, as a byproduct, a new SLM based on TM has been developed, capable of competing with established ones. This could serve as a viable alternative, given its low computational demands and its production of additional information that can be valuable for refining taxonomies. Consequently in the last part of the research a tuning of the hyperparameters of the best TM emerged from the comparison tests has been performed. At the end, with that optimal settings, a test over a huge dataset of 6 Millions of documents has been conducted.
BORROTTI, MATTEO
Topic identification; Topic modelling; Mix of topics; Big Data; LDA
Topic identification; Topic modelling; Mix of topics; Big Data; LDA
SECS-S/01 - STATISTICA
English
21-feb-2024
35
2022/2023
open
(2024). A novel methodology to make topic models predict real topics and to compare them in big data corpus. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2024).
File in questo prodotto:
File Dimensione Formato  
phd_unimib_854308.pdf

accesso aperto

Descrizione: Tesi Silvio Gerli - matricola 854308
Tipologia di allegato: Doctoral thesis
Dimensione 2.97 MB
Formato Adobe PDF
2.97 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/461858
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact