The amount of information produced every day is steadily increasing. The extraction of knowledge from such information is becoming a key aspect for many companies and institutions which spend a great deal of efforts in document management and organization with slightly sufficient results. In this dissertation, we are mainly concerned with probabilistic graphical model for knowledge extraction and performance estimation. In particular, we present a set of models that could improve information mining by relieving the user from boring duties and offering efficient ways to manage, classify, tag and retrieve documents. We adopted a Bayesian hierarchical model, Latent Dirichlet Allocation to extract meaningful topic from a given collection, and naïve Bayes classifier to tag new documents in a multi-label scenario. Moreover we present a novel and sound technique for the evaluation of Topic Models based on a probabilistic measure derived from state of the art performance index, namely Folkes-Mallows index, deriving a representation based on hyper-geometric distribution. Finally we identified a hybrid approach for searching in semantic repositories enriched by textual sources easing the process of information gathering exploiting state of the art information retrieval techniques.

(2011). Graphical models for text mining: knowledge extraction and performance estimation. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2011).

Graphical models for text mining: knowledge extraction and performance estimation

MAGATTI, DAVIDE
2011

Abstract

The amount of information produced every day is steadily increasing. The extraction of knowledge from such information is becoming a key aspect for many companies and institutions which spend a great deal of efforts in document management and organization with slightly sufficient results. In this dissertation, we are mainly concerned with probabilistic graphical model for knowledge extraction and performance estimation. In particular, we present a set of models that could improve information mining by relieving the user from boring duties and offering efficient ways to manage, classify, tag and retrieve documents. We adopted a Bayesian hierarchical model, Latent Dirichlet Allocation to extract meaningful topic from a given collection, and naïve Bayes classifier to tag new documents in a multi-label scenario. Moreover we present a novel and sound technique for the evaluation of Topic Models based on a probabilistic measure derived from state of the art performance index, namely Folkes-Mallows index, deriving a representation based on hyper-geometric distribution. Finally we identified a hybrid approach for searching in semantic repositories enriched by textual sources easing the process of information gathering exploiting state of the art information retrieval techniques.
STELLA, FABIO ANTONIO
SCHETTINI, RAIMONDO
text mining, bayesian generative models, Latent Dirichlet Allocation, document management, probabilistic metrics, hybrid search, semantic graphs
INF/01 - INFORMATICA
English
8-feb-2011
Scuola di dottorato di Scienze
INFORMATICA - 22R
23
2009/2010
Part of the dissertation is derived by a joint work with Siemens AG - Corporate Technology - Munich.
open
(2011). Graphical models for text mining: knowledge extraction and performance estimation. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2011).
File in questo prodotto:
File Dimensione Formato  
phd_unimib_041819.pdf

accesso aperto

Tipologia di allegato: Doctoral thesis
Dimensione 1.37 MB
Formato Adobe PDF
1.37 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/19576
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact