The large amount of short read data that has to be assembled in future applications, such as in metagenomics or cancer genomics, strongly motivates the investigation of disk-based approaches to index next-generation sequencing (NGS) data. Positive results in this direction stimulate the investigation of efficient external memory algorithms for de novo assembly from NGS data. Our article is also motivated by the open problem of designing a space-efficient algorithm to compute a string graph using an indexing procedure based on the Burrows-Wheeler transform (BWT). We have developed a disk-based algorithm for computing string graphs in external memory: the light string graph (LSG). LSG relies on a new representation of the FM-index that is exploited to use an amount of main memory requirement that is independent from the size of the data set. Moreover, we have developed a pipeline for genome assembly from NGS data that integrates LSG with the assembly step of SGA (Simpson and Durbin, 2012), a state-of-the-art string graph-based assembler, and uses BEETL for indexing the input data. LSG is open source software and is available online. We have analyzed our implementation on a 875-million read whole-genome dataset, on which LSG has built the string graph using only 1GB of main memory (reducing the memory occupation by a factor of 50 with respect to SGA), while requiring slightly more than twice the time than SGA. The analysis of the entire pipeline shows an important decrease in memory usage, while managing to have only a moderate increase in the running time.

Bonizzoni, P., DELLA VEDOVA, G., Pirola, Y., Previtali, M., Rizzi, R. (2016). LSG: An External-Memory Tool to Compute String Graphs for Next-Generation Sequencing Data Assembly. JOURNAL OF COMPUTATIONAL BIOLOGY, 23(3), 137-149 [10.1089/cmb.2015.0172].

LSG: An External-Memory Tool to Compute String Graphs for Next-Generation Sequencing Data Assembly

BONIZZONI, PAOLA
;
DELLA VEDOVA, GIANLUCA;PIROLA, YURI;PREVITALI, MARCO;RIZZI, RAFFAELLA
2016

Abstract

The large amount of short read data that has to be assembled in future applications, such as in metagenomics or cancer genomics, strongly motivates the investigation of disk-based approaches to index next-generation sequencing (NGS) data. Positive results in this direction stimulate the investigation of efficient external memory algorithms for de novo assembly from NGS data. Our article is also motivated by the open problem of designing a space-efficient algorithm to compute a string graph using an indexing procedure based on the Burrows-Wheeler transform (BWT). We have developed a disk-based algorithm for computing string graphs in external memory: the light string graph (LSG). LSG relies on a new representation of the FM-index that is exploited to use an amount of main memory requirement that is independent from the size of the data set. Moreover, we have developed a pipeline for genome assembly from NGS data that integrates LSG with the assembly step of SGA (Simpson and Durbin, 2012), a state-of-the-art string graph-based assembler, and uses BEETL for indexing the input data. LSG is open source software and is available online. We have analyzed our implementation on a 875-million read whole-genome dataset, on which LSG has built the string graph using only 1GB of main memory (reducing the memory occupation by a factor of 50 with respect to SGA), while requiring slightly more than twice the time than SGA. The analysis of the entire pipeline shows an important decrease in memory usage, while managing to have only a moderate increase in the running time.
Articolo in rivista - Articolo scientifico
Burrows-Wheeler transform; external-memory algorithms; string graphs;
English
2016
23
3
137
149
reserved
Bonizzoni, P., DELLA VEDOVA, G., Pirola, Y., Previtali, M., Rizzi, R. (2016). LSG: An External-Memory Tool to Compute String Graphs for Next-Generation Sequencing Data Assembly. JOURNAL OF COMPUTATIONAL BIOLOGY, 23(3), 137-149 [10.1089/cmb.2015.0172].
File in questo prodotto:
File Dimensione Formato  
LSG An xternal-Memory Tool to Compute String Gra.pdf

Solo gestori archivio

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Dimensione 286.11 kB
Formato Adobe PDF
286.11 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/105592
Citazioni
  • Scopus 11
  • ???jsp.display-item.citation.isi??? 10
Social impact