Combinatorial problems in studies of genetic variations: haplotyping and transcript analysis

Pirola, Y

After the first draft of the human genome has been published, the study of genetic variations has attracted increasing attention as a medium to map observable characteristics (phenotypic traits) of individuals to their underlying genes and biological processes. Unfortunately, observing and obtaining directly the genetic data of interest is a long and expensive operation, especially for large populations. Less informative sources of data are instead available at a fraction of cost. Computational methods are then called to recover the original data of interest from their less informative observations. The design of efficient combinatorial algorithms to infer relevant data for genetic variation studies starting from less informative observations is the main aim of the work of this thesis. In particular, we focused on two different kinds of data of crucial importance for genetic variation studies: haplotypes and transcripts. In the first part of the thesis, we studied the Haplotype Inference (HI) problem that requires to infer, for each individual of a population, the genetic sequences inherited from each parent (haplotypes) starting from their conflation (genotype). We formulated and analyzed two different computational problems of haplotype inference: the pure parsimony xor-haplotyping problem (PPXH) and the minimum event haplotype configuration problem (MinEHC). In the first problem, the haplotype inference process is guided by the pure parsimony criterion and it starts from a recently proposed representation of genotypes, called xor-genotypes. For this problem, we presented exact and approximate solutions by providing (i) polynomial time exact algorithms for some restricted cases, (ii) a fixed-parameter algorithm for the general case, and (iii) a polynomial time k-approximation algorithm, where k is the maximum number of heterozygous individuals at a given marker locus. These results are based on some interesting combinatorial properties of a graph representation of the solutions. Furthermore, we proposed a heuristic algorithm that can scale to real-world large instances, as shown in an extensive experimental evaluation. In the second HI problem, we assumed that parental relationships among members of the population are known. In the thesis we formulated and studied the HI problem on such populations under the assumption that the two most frequent kinds of variation events, namely recombinations and mutations, can occur. This formulation significantly extends previous works where only one of them (or none) was allowed. We proved that the problem, called MinEHC, is computationally hard to solve or to approximate even on simple instances. However, we provided a heuristic algorithm that has shown extremely good performances, both in terms of accuracy and running times, on several simulated scenarios. The heuristic algorithm is based on a L-reduction from MinEHC to a well-known coding problem, Decoding of Linear Codes (DLC). By (heuristically) solving DLC, we are able to recover the set of genetic variations that have occurred in the population and, then, to infer the set of haplotypes. In the second part of the thesis, we analyzed the problem of predicting the gene structure by aligning transcript fragments to a reference genomic sequence. Traditional basic approaches focus on computing high-quality transcript alignments that should reveal the gene structure. However, inherent ambiguities in the transcript alignments could heavily affect the prediction quality. In the thesis, instead, we formulated the gene structure prediction problem as the problem of choosing the simplest gene structure that can “explain” ambiguous alignments of several transcript fragments. We proposed two fast algorithmic solutions for the two combinatorial problems that stem from such formulation. The first algorithm efficiently computes all the possible (and biologically meaningful) alignments of a given transcript against a reference genomic sequence. The second algorithm, given all the possible alignments of a set of transcripts, extracts a single factorization for each transcript such that the induced gene structure is composed by the minimum number of expressed regions (exons). We also conducted an experimental evaluation of our strategy on a significant set of genes, and promising preliminary results have emerged even in absence of specific biological criteria.

Bicocca Open Archive