The genome of any individual in the human population is characterized by a unique complement of genomic variants distinguishing its DNA from any other. As such, studying the relationship between genomic variants and observable traits in human individuals is important for many applications in medicine. To identify these variants, current sequencing technologies produce a huge amount of DNA fragments, called reads. Unfortunately, these reads do not offer a complete and exact view of the DNA sequence since they are orders of magnitude shorter than the source, contain errors, and are obtained by considering millions of different cells together. As such, the inference of genomic variants from sequencing reads is one of the main problems in computational biology, a branch of computer science that designs algorithms for answering biological questions. The work of my thesis belongs to this context and is hence focused on the inference of genomic variants from sequencing reads. I adopt an approach based on combinatorial optimization to introduce new problem formulations and related algorithms by exploiting characteristics specific to the context. The first problem concerns the assembly of the two haplotypes contained in each cell of a human individual. A haplotype corresponds to the set of DNA sequences inherited from each parent. Since the two haplotypes comprise different genomic variants, their reconstruction is crucial for characterizing the genome of an individual: Single-Nucleotide Polymorphisms (SNPs) are the most common form of genomic variants between the two haplotypes. Haplotype Assembly is the approach that aims to reconstruct the two haplotypes and their SNPs from sequencing reads. First, in this thesis I study the parameterized tractability and approximability of traditional combinatorial problems for this approach, especially the Minimum Error Correction (MEC). In fact, methods based of these frameworks revealed preliminary good results on real data. Next, I introduce a new combinatorial formulation of MEC by exploiting the characteristics of reads produced by “future-sequencing” technologies, such as the longer reads and the uniform distribution of sequencing errors. For this problem, I design a dynamic-programming algorithm, HapCol, that outperforms current state-of-the-art methods on both simulated and real data. The second problem concerns the quantification of intra-tumor heterogeneity. Cancer results from an evolutionary process where somatic mutations accumulate in different cells during the lifetime of an individual. As such, a tumor comprises distinct subpopulations of cells, or clones, sharing a unique complement of genomic variants. The identification of these clones, their proportions, and their evolutionary history is crucial to both diagnosis and prognosis of cancer. Here, I focus on Copy-Number Aberrations (CNAs) that are mutations amplifying or deleting the copies of genomic segments. In this thesis, I introduce new combinatorial formulations of problems that aim to quantify the distinct clones in terms of CNAs and to infer their evolution from sequencing reads of multiple heterogeneous samples comprising different cells of different clones. I study the computational complexity of these problems and I design algorithms, based on integer-linear programming and coordinate-descent approach. On simulated data, I show that these algorithms scale on simulated instances of practical size, whereas on a prostate-cancer dataset they offer a high-resolution view of the tumor clones in term of CNAs, higher than previous analyses.

Il genoma di ogni individuo nella popolazione umana è caratterizzato univocamente da un insieme di varianti genomiche che distinguono il suo DNA da ogni altro. Perciò, studiare la relazione tra queste varianti e le differenze osservabili negli individui è fondamentale per molte applicazioni mediche. Per identificare queste varianti, le tecnologie di sequenziamento producono un’enorme quantità di frammenti di DNA. Tuttavia, questi frammenti danno solo una visione approssimativa del DNA di un individuo, perché sono molto più corti, contengono diversi errori e sono ottenuti considerando milioni di cellule contemporaneamente. Di conseguenza, l’inferenza di varianti genomiche da questi frammenti è uno dei problemi principali in biologia computazionale, una branca dell’informatica che progetta algoritmi per affrontare problemi derivanti da dati biologici. Il mio lavoro di ricerca in questa tesi si colloca in questo contesto e si focalizza su due problemi che riguardano, difatti, l’inferenza di varianti genomiche da dati di sequenziamento. Per questi problemi ho adottato un approccio basato su ottimizzazione combinatoria per il quale ho introdotto nuove formulazioni combinatorie e relativi algoritmi, sfruttato caratteristiche specifiche del contesto. Il primo problema riguarda la ricostruzione dei due aplotipi di un individuo, ognuno dei quali corrisponde all’insieme di sequenze di DNA ereditate da ognuno dei due genitori. La loro ricostruzione è fondamentale per caratterizzare il genoma umano, visto che i due aplotipi sono distinti da diverse varianti genomiche. La forma piu’ frequente di varianti genomiche negli aplotipi sono i Single-Nucleotice Polymorphisms (SNPs). L’assemblaggio di aplotipi è l’approccio che cerca di ricostruire gli aplotipi e i loro SNPs a partire da frammenti di sequenziamento. Anzitutto, in questa tesi mi sono occupato di studiare la complessità parametrica e l’approssimazione di tradizionali formulazioni combinatorie di questo problema, tra cui la correzione del minimo numero di errori (MEC). Infatti, metodi basati su questi approcci hanno mostrato ottimi risultati preliminari su dati reali. Inoltre, sfruttando le caratteristiche dei frammenti prodotti dalle nuove tecnologie, quali la lunghezza aumentata e la distribuzione uniforme degli errori, ho introdotto una nuova formulazione combinatoria del problema. Per questa, ho progettato un nuovo algoritmo, chiamato HapCol, che si è rivelato in grado di migliorare i risultati ottenuti da metodi tradizionali sia su dati simulati che reali. Il secondo problema riguarda l’identificazione e quantificazione dei cloni di un singolo tumore. Infatti, un tumore deriva da un processo evolutivo in cui nuove mutazioni vengono accumulate durante la vita di un individuo in diverse cellule, risultando in distinte sottopopolazioni di cellule, chiamate cloni, ognuna delle quali è caratterizzata da una combinazione unica di varianti genomiche. L’identificazione di questi cloni, delle loro proporzioni e della loro storia evolutiva si è rivelato di fondamentale importanza sia per la diagnosi che la prognosi di tumori. In particolare, mi sono focalizzato su mutazioni frequenti nei tumori, chiamate Copy-Number Aberrations (CNAs) che amplificano o eliminano copie di segmenti genomici. Perciò, ho introdotto nuove formulazioni combinatorie che mirano ad identificare i diversi cloni distinti da queste mutazioni e la loro evoluzione a partire da frammenti di sequenziamento ottenuti da più campioni dello stesso tumore e contenenti cellule di cloni diversi. Ho studiato la loro complessita’ computazionale e ho disegnato algoritmi, basati su programmazione lineare intera e su strategie di “discesa delle coordinate”, che si sono rivelati in grado di gestire istanze di dimensioni reali su dati simulati, mentre, su dati derivanti da un tumore alla prostata, hanno offerto una visione ad alta risoluzione dei suoi cloni in termini di CNAs.

(2017). Inferring Genomic Variants and their Evolution: Combinatorial Optimization for Haplotype Assembly and Quantification of Intra-Tumor Heterogeneity. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2017).

Inferring Genomic Variants and their Evolution: Combinatorial Optimization for Haplotype Assembly and Quantification of Intra-Tumor Heterogeneity

ZACCARIA, SIMONE
2017

Abstract

The genome of any individual in the human population is characterized by a unique complement of genomic variants distinguishing its DNA from any other. As such, studying the relationship between genomic variants and observable traits in human individuals is important for many applications in medicine. To identify these variants, current sequencing technologies produce a huge amount of DNA fragments, called reads. Unfortunately, these reads do not offer a complete and exact view of the DNA sequence since they are orders of magnitude shorter than the source, contain errors, and are obtained by considering millions of different cells together. As such, the inference of genomic variants from sequencing reads is one of the main problems in computational biology, a branch of computer science that designs algorithms for answering biological questions. The work of my thesis belongs to this context and is hence focused on the inference of genomic variants from sequencing reads. I adopt an approach based on combinatorial optimization to introduce new problem formulations and related algorithms by exploiting characteristics specific to the context. The first problem concerns the assembly of the two haplotypes contained in each cell of a human individual. A haplotype corresponds to the set of DNA sequences inherited from each parent. Since the two haplotypes comprise different genomic variants, their reconstruction is crucial for characterizing the genome of an individual: Single-Nucleotide Polymorphisms (SNPs) are the most common form of genomic variants between the two haplotypes. Haplotype Assembly is the approach that aims to reconstruct the two haplotypes and their SNPs from sequencing reads. First, in this thesis I study the parameterized tractability and approximability of traditional combinatorial problems for this approach, especially the Minimum Error Correction (MEC). In fact, methods based of these frameworks revealed preliminary good results on real data. Next, I introduce a new combinatorial formulation of MEC by exploiting the characteristics of reads produced by “future-sequencing” technologies, such as the longer reads and the uniform distribution of sequencing errors. For this problem, I design a dynamic-programming algorithm, HapCol, that outperforms current state-of-the-art methods on both simulated and real data. The second problem concerns the quantification of intra-tumor heterogeneity. Cancer results from an evolutionary process where somatic mutations accumulate in different cells during the lifetime of an individual. As such, a tumor comprises distinct subpopulations of cells, or clones, sharing a unique complement of genomic variants. The identification of these clones, their proportions, and their evolutionary history is crucial to both diagnosis and prognosis of cancer. Here, I focus on Copy-Number Aberrations (CNAs) that are mutations amplifying or deleting the copies of genomic segments. In this thesis, I introduce new combinatorial formulations of problems that aim to quantify the distinct clones in terms of CNAs and to infer their evolution from sequencing reads of multiple heterogeneous samples comprising different cells of different clones. I study the computational complexity of these problems and I design algorithms, based on integer-linear programming and coordinate-descent approach. On simulated data, I show that these algorithms scale on simulated instances of practical size, whereas on a prostate-cancer dataset they offer a high-resolution view of the tumor clones in term of CNAs, higher than previous analyses.
BONIZZONI, PAOLA
LEPORATI, ALBERTO OTTAVIO
RAPHAEL, BENJAMIN
evolution; haplotype; cancer; heterogeneity; copy-number
evolution; haplotype; cancer; heterogeneity; copy-number
INF/01 - INFORMATICA
English
27-mar-2017
INFORMATICA - 87R
29
2015/2016
open
(2017). Inferring Genomic Variants and their Evolution: Combinatorial Optimization for Haplotype Assembly and Quantification of Intra-Tumor Heterogeneity. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2017).
File in questo prodotto:
File Dimensione Formato  
phd_unimib_718549.pdf

accesso aperto

Descrizione: tesi di dottorato
Tipologia di allegato: Doctoral thesis
Dimensione 5.89 MB
Formato Adobe PDF
5.89 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/151631
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact