Bicocca Open Archive

Genotype phasing – the process of reconstructing haplotypes from genotype data – is a fundamental problem in genomics with applications in ancestry inference, imputation, and disease association. Traditional phasing methods rely on statistical models or combinatorial approaches which can be computationally expensive, particularly when applied to large-scale reference panels. In this paper, we present a first exploration of using the μ-PBWT (a run-length encoded Positional Burrows–Wheeler Transform) to solve the genotype phasing problem with a reference panel. Leveraging our previous results on positional substrings, we propose an approach that can explain a query genotype if the corresponding haplotype pair exists in the input panel. Moreover, our method is extended to cases where such a pair does not exist, even though some regions should remain unphased if they cannot be explicitly explained using the reference panel. We implemented this method and compared it against Beagle, a state-of-the-art phasing tool, demonstrating that, in the absence of mutations and recombinations, our approach correctly identifies the haplotype pair that explains a genotype query while using seven times less memory than Beagle. However, we also observe that as mutation rates increase, the quality of the phasing decreases as a result of the growing difficulty of identifying consistent haplotype pairs in the presence of sequence variation. These findings highlight the potential of μ-PBWT as an efficient alternative for genotype phasing, particularly in settings where computational resources are limited. The source code is publicly available at https://github.com/dlcgold/muPBWT/tree/phase.

Cozzi, D., Bonizzoni, P., Boucher, C., Langmead, B., Pirola, Y. (2025). Phasing Data from Genotype Queries via the µ-PBWT. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing [10.4230/OASIcs.Manzini.10].

Phasing Data from Genotype Queries via the µ-PBWT

Cozzi D.;Bonizzoni P.;Boucher C.;Langmead B.;Pirola Y.

2025

Abstract

Genotype phasing – the process of reconstructing haplotypes from genotype data – is a fundamental problem in genomics with applications in ancestry inference, imputation, and disease association. Traditional phasing methods rely on statistical models or combinatorial approaches which can be computationally expensive, particularly when applied to large-scale reference panels. In this paper, we present a first exploration of using the μ-PBWT (a run-length encoded Positional Burrows–Wheeler Transform) to solve the genotype phasing problem with a reference panel. Leveraging our previous results on positional substrings, we propose an approach that can explain a query genotype if the corresponding haplotype pair exists in the input panel. Moreover, our method is extended to cases where such a pair does not exist, even though some regions should remain unphased if they cannot be explicitly explained using the reference panel. We implemented this method and compared it against Beagle, a state-of-the-art phasing tool, demonstrating that, in the absence of mutations and recombinations, our approach correctly identifies the haplotype pair that explains a genotype query while using seven times less memory than Beagle. However, we also observe that as mutation rates increase, the quality of the phasing decreases as a result of the growing difficulty of identifying consistent haplotype pairs in the presence of sequence variation. These findings highlight the potential of μ-PBWT as an efficient alternative for genotype phasing, particularly in settings where computational resources are limited. The source code is publicly available at https://github.com/dlcgold/muPBWT/tree/phase.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
				paper
			
	Parole chiave
	
				genotype phasing; minimal position substring cover; Positional Burrows-Wheeler Transform; r-index; set-maximal exact matches;
			
	Lingua del contenuto
	
				English
			
	Nome del convegno
	
				Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday - 25 July 2025
			
	Anno del convegno
	
				2025
			
	Curatori della monografia
	
				Ferragina P; Gagie T; Navarro G
			
	Titolo degli atti
	
				The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday
			
	ISBN del volume degli atti
	
				9783959773904
			
	Collana o serie
	
				OPEN ACCESS SERIES IN INFORMATICS
			
	Data di pubblicazione
	
				2025
			
	Numero del volume
	
				131
			
	Article number
	
				10
			
	DOI dell'intervento
	
				https://dx.doi.org/10.4230/OASIcs.Manzini.10
			
	Fulltext
	
				open
			
	Citazione
	
				Cozzi, D., Bonizzoni, P., Boucher, C., Langmead, B., Pirola, Y. (2025). Phasing Data from Genotype Queries via the µ-PBWT. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing [10.4230/OASIcs.Manzini.10].
			
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

File	Dimensione	Formato
Cozzi et al-2025-OpenAccess Series in Informatics-VoR.pdf accesso aperto Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 1.11 MB Formato Adobe PDF Visualizza/Apri	1.11 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/572181

Citazioni

0

ND

Social impact