Bicocca Open Archive

Motivation: Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of 'future-generation' sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions. Results: By exploiting a feature of future-generation technologies - the uniform distribution of sequencing errors - we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption. Availability and implementation: Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/. Supplementary information: Supplementary data are available at Bioinformatics online.

Pirola, Y., Zaccaria, S., Dondi, R., Klau, G., Pisanti, N., Bonizzoni, P. (2016). HapCol: Accurate and memory-efficient haplotype assembly from long reads. BIOINFORMATICS, 32(11), 1610-1617 [10.1093/bioinformatics/btv495].

HapCol: Accurate and memory-efficient haplotype assembly from long reads

Pirola, Y;Zaccaria, S;Dondi, R;Klau, G;Pisanti, N;Bonizzoni, P

2016

Abstract

Motivation: Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of 'future-generation' sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions. Results: By exploiting a feature of future-generation technologies - the uniform distribution of sequencing errors - we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption. Availability and implementation: Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/. Supplementary information: Supplementary data are available at Bioinformatics online.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				Biochemistry; Molecular Biology; Computational Theory and Mathematics; Computer Science Applications1707 Computer Vision and Pattern Recognition; Computational Mathematics; Statistics and Probability
			
	Lingua del contenuto
	
				English
			
	Data di pubblicazione
	
				2016
			
	Rivista
	
				BIOINFORMATICS
			
	Numero del volume
	
				32
			
	Fascicolo
	
				11
			
	Pagina iniziale
	
				1610
			
	Pagina finale
	
				1617
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1093/bioinformatics/btv495
			
	URL alternativo
	
				https://academic.oup.com/bioinformatics/article/32/11/1610/1742594
			
	Fulltext
	
				open
			
	Citazione
	
				Pirola, Y., Zaccaria, S., Dondi, R., Klau, G., Pisanti, N., Bonizzoni, P. (2016). HapCol: Accurate and memory-efficient haplotype assembly from long reads. BIOINFORMATICS, 32(11), 1610-1617 [10.1093/bioinformatics/btv495].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
main.pdf accesso aperto Descrizione: Articolo principale Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Tutti i diritti riservati Dimensione 263.61 kB Formato Adobe PDF Visualizza/Apri	263.61 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/132162

Citazioni

36

33

Social impact