A software system for topic discovery and document tagging is described. The system discovers the topics hidden in a given document collection, labels them according to user supplied taxonomy and tags new documents. It implements an information processing pipeline which consists of document preprocessing, topic extraction, automatic labeling of topics, and multi-label document classification. The preprocessing module allows importing of several kinds of documents and offers different document representations: binary, term frequency and term frequency inverse document frequency. The topic extraction module is implemented through a proprietary version of the Latent Dirichlet Allocation model. The optimal number of topics is selected through hierarchical clustering. The topic labeling module optimizes a set of similarity measures defined over the user supplied taxonomy. It is implemented through an algorithm over a topic tree. The document tagging module solves a multi-label classification problem through multi-net Naïve Bayes without the need to perform any learning tasks.

Magatti, D., Stella, F. (2011). Probabilistic Topic Discovery and Automatic Document Tagging. In R. Brena, A. Guzman (a cura di), Quantitative Semantics and Soft Computing Methods for the Web Perspectives and Applications (pp. 25-50). Information Science Pub [10.4018/978-1-60960-881-1].

Probabilistic Topic Discovery and Automatic Document Tagging

STELLA, FABIO ANTONIO
2011

Abstract

A software system for topic discovery and document tagging is described. The system discovers the topics hidden in a given document collection, labels them according to user supplied taxonomy and tags new documents. It implements an information processing pipeline which consists of document preprocessing, topic extraction, automatic labeling of topics, and multi-label document classification. The preprocessing module allows importing of several kinds of documents and offers different document representations: binary, term frequency and term frequency inverse document frequency. The topic extraction module is implemented through a proprietary version of the Latent Dirichlet Allocation model. The optimal number of topics is selected through hierarchical clustering. The topic labeling module optimizes a set of similarity measures defined over the user supplied taxonomy. It is implemented through an algorithm over a topic tree. The document tagging module solves a multi-label classification problem through multi-net Naïve Bayes without the need to perform any learning tasks.
Capitolo o saggio
Topic models; document tagging; Bayesian learning; text mining
English
Quantitative Semantics and Soft Computing Methods for the Web Perspectives and Applications
Brena, R; Guzman, A
2011
9781609608811
Information Science Pub
25
50
Magatti, D., Stella, F. (2011). Probabilistic Topic Discovery and Automatic Document Tagging. In R. Brena, A. Guzman (a cura di), Quantitative Semantics and Soft Computing Methods for the Web Perspectives and Applications (pp. 25-50). Information Science Pub [10.4018/978-1-60960-881-1].
none
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/25316
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact