Two of the main bioinformatics fields that have been influenced by the introduction of the Next-Generation Sequencing (NGS) techniques are transcriptomics and metagenomics. The adoption of these new methods to sequence DNA/RNA molecules has drastically changed the kind and also the amount of produced data. The effect is that all the developed algorithms and tools working on traditional data cannot be applied on NGS data. For this reason, in this thesis we face two central problems in two fields: transcriptmics and metagenomics. The first one regards the characterization of the Alternative Splicing (AS) events starting from NGS sequences coming from transcripts (called RNA-Seq reads). To this aim we have modeled the structure of a gene, with respect to the AS variations occurring in it, by using a graph representation (called splicing graph). More specifically, we have identified the conditions for the correct reconstruction of the splicing graph, starting from RNA-Seq data, and we have realized an algorithm for its construction. Moreover, our method is able to correct reconstruct the splicing graph even when the input RNA-Seq reads do not respect the identified conditions. Finally, we have performed an experimental analysis of our procedure in order to validated the obtained results. The second problem we face in this thesis is the assignment of NGS read, coming from a metagenomic sample, to a reference taxonomic tree, in order to assess the composition of the sample and classify the unknown micro-organisms in it. This is done by aligning the reads to the taxonomic tree and then choosing (when there are more valid matches) the node that best represents the read. This choice is based on the calculation of a Penalty Score (PS) function for all the nodes descending from the lowest common ancestor of the valid matches in the tree. We have realized an optimal algorithm for the computation of the PS function, based on the so called skeleton tree, which improve the performances of the taxonomic assignment procedure. We have also implemented the method by using more efficient data structures, with respect to the one used in the previous version of the procedure. Finally, we have offered the possibility to switch among different taxonomies by developing a method to map trees and translate the input alignments.
(2013). Algorithms for next generation sequencing data analysis. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2013).
Algorithms for next generation sequencing data analysis
BERETTA, STEFANO
2013
Abstract
Two of the main bioinformatics fields that have been influenced by the introduction of the Next-Generation Sequencing (NGS) techniques are transcriptomics and metagenomics. The adoption of these new methods to sequence DNA/RNA molecules has drastically changed the kind and also the amount of produced data. The effect is that all the developed algorithms and tools working on traditional data cannot be applied on NGS data. For this reason, in this thesis we face two central problems in two fields: transcriptmics and metagenomics. The first one regards the characterization of the Alternative Splicing (AS) events starting from NGS sequences coming from transcripts (called RNA-Seq reads). To this aim we have modeled the structure of a gene, with respect to the AS variations occurring in it, by using a graph representation (called splicing graph). More specifically, we have identified the conditions for the correct reconstruction of the splicing graph, starting from RNA-Seq data, and we have realized an algorithm for its construction. Moreover, our method is able to correct reconstruct the splicing graph even when the input RNA-Seq reads do not respect the identified conditions. Finally, we have performed an experimental analysis of our procedure in order to validated the obtained results. The second problem we face in this thesis is the assignment of NGS read, coming from a metagenomic sample, to a reference taxonomic tree, in order to assess the composition of the sample and classify the unknown micro-organisms in it. This is done by aligning the reads to the taxonomic tree and then choosing (when there are more valid matches) the node that best represents the read. This choice is based on the calculation of a Penalty Score (PS) function for all the nodes descending from the lowest common ancestor of the valid matches in the tree. We have realized an optimal algorithm for the computation of the PS function, based on the so called skeleton tree, which improve the performances of the taxonomic assignment procedure. We have also implemented the method by using more efficient data structures, with respect to the one used in the previous version of the procedure. Finally, we have offered the possibility to switch among different taxonomies by developing a method to map trees and translate the input alignments.File | Dimensione | Formato | |
---|---|---|---|
Phd_unimib_057617.pdf
accesso aperto
Tipologia di allegato:
Doctoral thesis
Dimensione
2.63 MB
Formato
Adobe PDF
|
2.63 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.