The amount of information produced every day is steadily increasing. The extraction of knowledge from such information is becoming a key aspect for many companies and institutions which spend a great deal of efforts in document management and organization with slightly sufficient results. In this dissertation, we are mainly concerned with probabilistic graphical model for knowledge extraction and performance estimation. In particular, we present a set of models that could improve information mining by relieving the user from boring duties and offering efficient ways to manage, classify, tag and retrieve documents. We adopted a Bayesian hierarchical model, Latent Dirichlet Allocation to extract meaningful topic from a given collection, and naïve Bayes classifier to tag new documents in a multi-label scenario. Moreover we present a novel and sound technique for the evaluation of Topic Models based on a probabilistic measure derived from state of the art performance index, namely Folkes-Mallows index, deriving a representation based on hyper-geometric distribution. Finally we identified a hybrid approach for searching in semantic repositories enriched by textual sources easing the process of information gathering exploiting state of the art information retrieval techniques.
(2011). Graphical models for text mining: knowledge extraction and performance estimation. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2011).
Graphical models for text mining: knowledge extraction and performance estimation
MAGATTI, DAVIDE
2011
Abstract
The amount of information produced every day is steadily increasing. The extraction of knowledge from such information is becoming a key aspect for many companies and institutions which spend a great deal of efforts in document management and organization with slightly sufficient results. In this dissertation, we are mainly concerned with probabilistic graphical model for knowledge extraction and performance estimation. In particular, we present a set of models that could improve information mining by relieving the user from boring duties and offering efficient ways to manage, classify, tag and retrieve documents. We adopted a Bayesian hierarchical model, Latent Dirichlet Allocation to extract meaningful topic from a given collection, and naïve Bayes classifier to tag new documents in a multi-label scenario. Moreover we present a novel and sound technique for the evaluation of Topic Models based on a probabilistic measure derived from state of the art performance index, namely Folkes-Mallows index, deriving a representation based on hyper-geometric distribution. Finally we identified a hybrid approach for searching in semantic repositories enriched by textual sources easing the process of information gathering exploiting state of the art information retrieval techniques.File | Dimensione | Formato | |
---|---|---|---|
phd_unimib_041819.pdf
accesso aperto
Tipologia di allegato:
Doctoral thesis
Dimensione
1.37 MB
Formato
Adobe PDF
|
1.37 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.