COMBINATORIAL METHODS FOR BIOLOGICAL DATA

Bernardini, G

The main goal of this thesis is to develop new algorithmic frameworks to deal with (i) a convenient representation of a set of similar genomes and (ii) phylogenetic data, with particular attention to the increasingly accurate tumor phylogenies. A “pan-genome” is, in general, any collection of genomic sequences to be analyzed jointly or to be used as a reference for a population. A phylogeny, in turn, is meant to describe the evolutionary relationships among a group of items, be they species of living beings, genes, natural languages, ancient manuscripts or cancer cells. With the exception of one of the results included in this thesis, related to the analysis of tumor phylogenies, the focus of the whole work is mainly theoretical, the intent being to lay firm algorithmic foundations for the problems by investigating their combinatorial aspects, rather than to provide practical tools for attacking them. Deep theoretical insights on the problems allow a rigorous analysis of existing methods, identifying their strong and weak points, providing details on how they perform and helping to decide which problems need to be further addressed. In addition, it is often the case where new theoretical results (algorithms, data structures and reductions to other well-studied problems) can either be directly applied or adapted to fit the model of a practical problem, or at least they serve as inspiration for developing new practical tools. The first part of this thesis is devoted to methods for handling an elastic-degenerate text, a computational object that compactly encodes a collection of similar texts, like a pan-genome. Specifically, we attack the problem of matching a sequence in an elastic-degenerate text, both exactly and allowing a certain amount of errors, and the problem of comparing two degenerate texts. In the second part we consider both tumor phylogenies, describing the evolution of a tumor, and “classical” phylogenies, representing, for instance, the evolutionary history of the living beings. In particular, we present new techniques to compare two or more tumor phylogenies, needed to evaluate the results of different inference methods, and we give a new, efficient solution to a longstanding problem on “classical” phylogenies: to decide whether, in the presence of missing data, it is possible to arrange a set of species in a phylogenetic tree that enjoys specific properties.

Lo scopo di questa tesi è di elaborare e analizzare metodi rigorosi dal punto di vista matematico per l’analisi di due tipi di dati biologici: dati relativi a pan-genomi e filogenesi. Con il termine “pan-genoma” si indica, in generale, un insieme di sequenze genomiche strettamente correlate (tipicamente appartenenti a individui della stessa specie) che si vogliano utilizzare congiuntamente come sequenze di riferimento per un’intera popolazione. Una filogenesi, invece, rappresenta le relazioni evolutive in un gruppo di entità, che siano esseri viventi, geni, lingue naturali, manoscritti antichi o cellule tumorali. Con l’eccezione di uno dei risultati presentati in questa tesi, relativo all’analisi di filogenesi tumorali, il taglio della dissertazione è prevalentemente teorico: lo scopo è studiare gli aspetti combinatori dei problemi affrontati, più che fornire soluzioni efficaci in pratica. Una conoscenza approfondita degli aspetti teorici di un problema, del resto, permette un'analisi matematicamente rigorosa delle soluzioni già esistenti, individuandone i punti deboli e quelli di forza, fornendo preziosi dettagli sul loro funzionamento e aiutando a decidere quali problemi vadano ulteriormente investigati. Oltretutto, è spesso il caso che nuovi risultati teorici (algoritmi, strutture dati o riduzioni ad altri problemi più noti) si possano direttamente applicare o adattare come soluzione ad un problema pratico, o come minimo servano ad ispirare lo sviluppo di nuovi metodi efficaci in pratica. La prima parte della tesi è dedicata a nuovi metodi per eseguire delle operazioni fondamentali su un testo elastico-degenerato, un oggetto computazionale che codifica in maniera compatta un insieme di testi simili tra loro, come, ad esempio, un pan-genoma. Nello specifico, si affrontano il problema di cercare una sequenza di lettere in un testo elastico-degenerato, sia in maniera esatta che tollerando un numero prefissato di errori, e quello di confrontare due testi degenerati. Nella seconda parte si considerano sia filogenesi tumorali, che ricostruiscono per l'appunto l'evoluzione di un tumore, sia filogenesi "classiche", che rappresentano, ad esempio, la storia evolutiva delle specie viventi. In particolare, si presentano nuove tecniche per confrontare due o più filogenesi tumorali, necessarie per valutare i risultati di diversi metodi che ricostruiscono le filogenesi stesse, e una nuova e più efficiente soluzione a un problema di lunga data relativo a filogenesi "classiche", consistente nel determinare se sia possibile sistemare, in presenza di dati mancanti, un insieme di specie in un albero filogenetico che abbia determinate proprietà.

(2021). COMBINATORIAL METHODS FOR BIOLOGICAL DATA. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2021).