Genotype phasing – the process of reconstructing haplotypes from genotype data – is a fundamental problem in genomics with applications in ancestry inference, imputation, and disease association. Traditional phasing methods rely on statistical models or combinatorial approaches which can be computationally expensive, particularly when applied to large-scale reference panels. In this paper, we present a first exploration of using the μ-PBWT (a run-length encoded Positional Burrows–Wheeler Transform) to solve the genotype phasing problem with a reference panel. Leveraging our previous results on positional substrings, we propose an approach that can explain a query genotype if the corresponding haplotype pair exists in the input panel. Moreover, our method is extended to cases where such a pair does not exist, even though some regions should remain unphased if they cannot be explicitly explained using the reference panel. We implemented this method and compared it against Beagle, a state-of-the-art phasing tool, demonstrating that, in the absence of mutations and recombinations, our approach correctly identifies the haplotype pair that explains a genotype query while using seven times less memory than Beagle. However, we also observe that as mutation rates increase, the quality of the phasing decreases as a result of the growing difficulty of identifying consistent haplotype pairs in the presence of sequence variation. These findings highlight the potential of μ-PBWT as an efficient alternative for genotype phasing, particularly in settings where computational resources are limited. The source code is publicly available at https://github.com/dlcgold/muPBWT/tree/phase.

Cozzi, D., Bonizzoni, P., Boucher, C., Langmead, B., Pirola, Y. (2025). Phasing Data from Genotype Queries via the µ-PBWT. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing [10.4230/OASIcs.Manzini.10].

Phasing Data from Genotype Queries via the µ-PBWT

Cozzi D.
;
Bonizzoni P.;Pirola Y.
2025

Abstract

Genotype phasing – the process of reconstructing haplotypes from genotype data – is a fundamental problem in genomics with applications in ancestry inference, imputation, and disease association. Traditional phasing methods rely on statistical models or combinatorial approaches which can be computationally expensive, particularly when applied to large-scale reference panels. In this paper, we present a first exploration of using the μ-PBWT (a run-length encoded Positional Burrows–Wheeler Transform) to solve the genotype phasing problem with a reference panel. Leveraging our previous results on positional substrings, we propose an approach that can explain a query genotype if the corresponding haplotype pair exists in the input panel. Moreover, our method is extended to cases where such a pair does not exist, even though some regions should remain unphased if they cannot be explicitly explained using the reference panel. We implemented this method and compared it against Beagle, a state-of-the-art phasing tool, demonstrating that, in the absence of mutations and recombinations, our approach correctly identifies the haplotype pair that explains a genotype query while using seven times less memory than Beagle. However, we also observe that as mutation rates increase, the quality of the phasing decreases as a result of the growing difficulty of identifying consistent haplotype pairs in the presence of sequence variation. These findings highlight the potential of μ-PBWT as an efficient alternative for genotype phasing, particularly in settings where computational resources are limited. The source code is publicly available at https://github.com/dlcgold/muPBWT/tree/phase.
paper
genotype phasing; minimal position substring cover; Positional Burrows-Wheeler Transform; r-index; set-maximal exact matches;
English
Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday - 25 July 2025
2025
Ferragina P; Gagie T; Navarro G
The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday
9783959773904
2025
131
10
open
Cozzi, D., Bonizzoni, P., Boucher, C., Langmead, B., Pirola, Y. (2025). Phasing Data from Genotype Queries via the µ-PBWT. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing [10.4230/OASIcs.Manzini.10].
File in questo prodotto:
File Dimensione Formato  
Cozzi et al-2025-OpenAccess Series in Informatics-VoR.pdf

accesso aperto

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Creative Commons
Dimensione 1.11 MB
Formato Adobe PDF
1.11 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/572181
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
Social impact