Self-indexing for de novo assembly

Previtali, M

For at least three decades computer science approaches have proved to be of utmost importance for extrapolating knowledge from biological data. Indeed, the amount of data produced by sequencing technologies has increased more than exponentially, outpacing Moore's law and making automatic analysis the only viable option. Raw biological data produced by sequencing technologies can easily be represented using computer science techniques. Indeed, for many applications, DNA and RNA can be represented as one-dimensional sequences of nucleobases that can be represented as a word which characters belong to an extremely small alphabet of size 4. The representation, analysis, and storage of words and texts is a widely studied field in computer science in which such sequences are usually referred as strings. Close collaboration between biologists and computer scientists is unavoidable and extremely beneficial for both fields. Indeed, on one end the former pose computational challenges to the latter stimulating and motivating research in the field whereas, on the other end, the latter help the former to extract biological meaning from a huge and unfathomable muddle of data. One of the main problems tackled by computer scientists in this field is the Sequence Assembly Problem (SAP, either with reference or de novo) which goal is to infer the original sequence S from a set of much more shorter substrings of S, called reads. At the core of most of the assemblers lies a graph data structure that represents some kind of relation between reads or part of them. The two most well known and used graphs in this field are de Bruijn graphs and String graphs. This thesis focuses on this field and, more precisely, proposes methods for representing, building, and analyzing such graphs. This thesis proposes: (i) methods to store and query de Bruijn graphs and (ii) methods to analyze String graphs using succinct space. Our first contribution is a new self-index for a de Bruijn graph based on Minimal Perfect Hashing. The main contribution of this method is a fully dynamic data structure that represents a de Bruijn graph in succinct space that allows to add and remove both edges and nodes without rebuilding it. State-of-the-art data structures for storing a de Bruijn are semi-dynamic and do not allow for removing nodes and edges. We present a theoretical analysis of this method and provide a formal analysis of the properties of it. The second contribution of this thesis is a self-index for de Bruijn graphs that can represent multiple de Bruijn graphs and that allows to analyze them efficiently. Such index is loosely inspired by state-of-the-art self-indexes such as the bidirectional Burrows-Wheeler Transform. We present a theoretical analysis of this method and provide possible applications of it in genome assembly. In this thesis we will also focus on efficient methods for analyzing and constructing String graphs. We will formalize a new framework for String graph-based assemblers in which String graphs nodes and edges are represented as intervals of the BWT requiring constant space to be stored improving the state of the art. We will therefore present a theoretical analysis of such framework and we will then study and show how this framework can be used to design String graph construction algorithms that either require low main memory or low time by external memory algorithms and multi-threaded approaches, respectively. We will also provide an implementation of both these approaches and perform an extensive experimental evaluation of the tools in comparison with state-of-the-art assemblers. The experimental evaluation shows that the framework we propose and its implementations allow to design tools that can run efficiently either on simple machines or on extremely powerful server showing the pliability of our contribution.

Durante le ultime tre decadi l'informatica si è dimostrata di fondamentale importanza per estrarre conoscenza dai dati biologici. La quantità di dati prodotta dalle tecnologie di sequenziamento è aumentata più che esponenzialmente, andando più velocemente della legge di Moore e rendendo l'analisi automatica dei dati l'unica via percorribile. I dati prodotti dalle tecnologie di sequenziamento possono essere rappresentati facilmente usando tecniche informatiche. Per molte applicazioni, infatti, il DNA può essere rappresentato come una sequenza di nucleobasi che possono essere a loro volta rappresentate come semplici caratteri. La rappresentazione, l'analisi e l'immagazzinamento di parole e testi è un campo studiato estensivamente in informatica nel quale queste sequenze vengono chiamate stringhe. Una stretta collaborazione tra biologi e informatici è ineluttabile ed estremamente vantaggiosa per entrambi. Da una parte, infatti, i biologi pongono sfide computazioniali agli informatici, stimolandone e motivandone la ricerca, dall'altra gli informatici aiutano i biologi ad estrarre significato biologico da un enorme e imperscrutabile insieme di dati. Uno dei principali problemi affrontati dagli informatici in questo ambito è il Sequence Assembly Problem (SAP) il cui obiettivo è quello di inferire una sequenza S da un insieme di sottosequenze molto più corte di essa. Alla base della maggior parte degli assemblatori risiede una struttura a grafo che rappresenta qualche tipo di relazione tra i read o parte di essi. Le due rappresentazioni a grafo più conosciute ed utilizzate sono i grafi di de Bruijn e gli String graph. Questa tesi si concentra in questo ambito e, più precisamente, propone metodi per rappresentare, costruire e analizzare questi grafi. In particolare, questa tesi propone: (i) metodi per immagazzinare ed interrogare grafi di de Bruijn e (ii) metodi per analizzare String graphs utilizzando poco spazio. Il primo contributo di questa tesi è un nuovo self-index per grafi di de Bruijn basato su Minimal Perfect Hashing. Il contributo principale di questo metodo è una struttura dati dinamica che rappresenta un grafo di de Bruijn in spazio succinto che permette di aggiungere e rimuovere sia nodi che archi. Le strutture dati utilizzate finora sono semi dinamiche poiché permettono di aggiungere solamente nodi ed archi. Di questo contributo viene presentata un'analisi teorica e vengono proposti possibili utilizzi in bioinformatica. Il secondo contributo di questa tesi è un self-index per grafi di de Bruijn che è in grado di rappresentare più grafi contemporaneamente e che permette di analizzarli efficientemente. Questo index è vagamente ispirato ad indici dello stato dell'arte come la bidirectional BWT. Di questo metodo viene presentata un'analisi teorica e vengono proposti possibili utilizzi in bioinformatica. In questa tesi si pone inoltre il focus su metodi efficienti per l'analisi e la costruzione di String graph. In particolare, viene formalizzato un nuovo framework per gli assemblatori basati su String graph nel quale essi vengono rappresentati in spazio succinto utilizzando informazioni relative alla BWT del dataset di riferimento. Questo framework verrà poi utilizzato per mostrare come sia possibile progettare algoritmi di costruzione di String graphs che possono richiedere o poca memoria o poco tempo per essere completati. Forniremo inoltre delle implementazioni di entrambi questi approcci effettuando una comparazione estensiva di questi strumenti con software presenti in letteratura. La comparazione sperimentale mostra che il framework che viene proposto e le sue implementazioni permettono di progettare strumenti che possono essere eseguite efficientemente sia su semplici workstation sia su potenti server, dimostrando la flessibilità della nostra proposta.

(2017). Self-indexing for de novo assembly. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2017).