Gene function annotations are key elements in biology and bioinformatics. A typical annotation is the association between a gene and a feature term that describes a functional feature of the gene by using a controlled vocabulary term (e.g. a Gene Ontology (GO) feature term). Unfortunately, available annotations contain errors and biologically validated ones are incomplete by definition, since new knowledge is continuously discovered. Thus, computational algorithms which are able to provide ranked lists of predicted new gene annotations are an excellent contribution to the bioinformatics research. Here, we propose two variants of the known Latent Dirichlet Allocation (LDA) algorithm applied to the prediction of gene annotations. LDA is a very efficient machine learning method built on a set of multinomial probability distributions over a set of topics, given a document (a gene, in our case), and on a set of multinomial probability distributions over a set of words (feature terms, in our case), given a topic. In topic modeling, a topic can be considered as a latent meta-category of words, and a document as a mixture of topics. Our two LDA variants use the collapsed Gibbs Sampling method during the training phase, with two distinct initialization approaches to adapt the LDA mathematical model to the biomolecular annotation scenario. Using six outdated datasets of GO annotations of human and brown rat genes, we compared the annotations predicted by our methods to the ones given by the truncated Singular Value Decomposition (tSVD) method previously developed; then, we validated them by using the annotations available in an updated version of the same datasets. Obtained results show the efficiency of our new proposed algorithms.

Pinoli, P., Chicco, D., Masseroli, M. (2014). Latent dirichlet allocation based on gibbs sampling for gene function prediction. In Proceedings of the International Conference on Computational Intelligence in Bioinformatics and Computational Biology – CIBCB 2014 (pp.1-8). IEEE Computer Society [10.1109/CIBCB.2014.6845514].

Latent dirichlet allocation based on gibbs sampling for gene function prediction

Chicco, D;
2014

Abstract

Gene function annotations are key elements in biology and bioinformatics. A typical annotation is the association between a gene and a feature term that describes a functional feature of the gene by using a controlled vocabulary term (e.g. a Gene Ontology (GO) feature term). Unfortunately, available annotations contain errors and biologically validated ones are incomplete by definition, since new knowledge is continuously discovered. Thus, computational algorithms which are able to provide ranked lists of predicted new gene annotations are an excellent contribution to the bioinformatics research. Here, we propose two variants of the known Latent Dirichlet Allocation (LDA) algorithm applied to the prediction of gene annotations. LDA is a very efficient machine learning method built on a set of multinomial probability distributions over a set of topics, given a document (a gene, in our case), and on a set of multinomial probability distributions over a set of words (feature terms, in our case), given a topic. In topic modeling, a topic can be considered as a latent meta-category of words, and a document as a mixture of topics. Our two LDA variants use the collapsed Gibbs Sampling method during the training phase, with two distinct initialization approaches to adapt the LDA mathematical model to the biomolecular annotation scenario. Using six outdated datasets of GO annotations of human and brown rat genes, we compared the annotations predicted by our methods to the ones given by the truncated Singular Value Decomposition (tSVD) method previously developed; then, we validated them by using the annotations available in an updated version of the same datasets. Obtained results show the efficiency of our new proposed algorithms.
paper
latent Dirichlet allocation; gene function prediction; gene function annotations; gene-feature term association; gene functional feature; controlled vocabulary term; gene ontology feature term; LDA algorithm; gene annotation prediction; machine learning method; multinomial probability distributions; feature terms; latent word metacategory; LDA variants; collapsed Gibbs sampling method; truncated singular value decomposition; tSVD comparison
English
2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB 2014 - 21 May 2014 through 24 May 2014
2014
Corns S Huang D-S
Proceedings of the International Conference on Computational Intelligence in Bioinformatics and Computational Biology – CIBCB 2014
9781479945368
2014
1
8
6845514
reserved
Pinoli, P., Chicco, D., Masseroli, M. (2014). Latent dirichlet allocation based on gibbs sampling for gene function prediction. In Proceedings of the International Conference on Computational Intelligence in Bioinformatics and Computational Biology – CIBCB 2014 (pp.1-8). IEEE Computer Society [10.1109/CIBCB.2014.6845514].
File in questo prodotto:
File Dimensione Formato  
Pinoli-2014-CIBCB-VoR.pdf

Solo gestori archivio

Descrizione: Intervento a convegno
Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Tutti i diritti riservati
Dimensione 513.98 kB
Formato Adobe PDF
513.98 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/435460
Citazioni
  • Scopus 28
  • ???jsp.display-item.citation.isi??? 6
Social impact