In this paper the problem of performing external validation of the semantic coherence of topic models is considered. The Fowlkes-Mallows index, a known clustering validation metric, is generalized for the case of overlapping partitions and multi-labeled collections, thus making it suitable for validating topic modeling algorithms. In addition, we propose new probabilistic metrics inspired by the concepts of recall and precision. The proposed metrics also have clear probabilistic interpretations and can be applied to validate and compare other soft and overlapping clustering algorithms. The approach is exemplified by using the Reuters-21578 multi-labeled collection to validate LDA models, then using Monte Carlo simulations to show the convergence to the predicted results. Additional statistical evidence is provided to better understand the relation of the metrics presented

Ramirez, E., Brena, R., Magatti, D., Stella, F. (2010). Probabilistic Metrics for Soft-Clustering and Topic Model Validation. In Proceedings - 2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010 (pp.406-412). IEEE [10.1109/WI-IAT.2010.148].

Probabilistic Metrics for Soft-Clustering and Topic Model Validation

STELLA, FABIO ANTONIO
2010

Abstract

In this paper the problem of performing external validation of the semantic coherence of topic models is considered. The Fowlkes-Mallows index, a known clustering validation metric, is generalized for the case of overlapping partitions and multi-labeled collections, thus making it suitable for validating topic modeling algorithms. In addition, we propose new probabilistic metrics inspired by the concepts of recall and precision. The proposed metrics also have clear probabilistic interpretations and can be applied to validate and compare other soft and overlapping clustering algorithms. The approach is exemplified by using the Reuters-21578 multi-labeled collection to validate LDA models, then using Monte Carlo simulations to show the convergence to the predicted results. Additional statistical evidence is provided to better understand the relation of the metrics presented
paper
Text Mining, Bayesian learning, Latent Dirichlet Allocation
English
2010 IEEE / WIC / ACM International Conferences
2010
Proceedings - 2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010
978-0-7695-4191-4
2010
1
406
412
5616623
none
Ramirez, E., Brena, R., Magatti, D., Stella, F. (2010). Probabilistic Metrics for Soft-Clustering and Topic Model Validation. In Proceedings - 2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010 (pp.406-412). IEEE [10.1109/WI-IAT.2010.148].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/18547
Citazioni
  • Scopus 12
  • ???jsp.display-item.citation.isi??? ND
Social impact