Genotype phasing – the process of reconstructing haplotypes from genotype data – is a fundamental problem in genomics with applications in ancestry inference, imputation, and disease association. Traditional phasing methods rely on statistical models or combinatorial approaches which can be computationally expensive, particularly when applied to large-scale reference panels. In this paper, we present a first exploration of using the μ-PBWT (a run-length encoded Positional Burrows–Wheeler Transform) to solve the genotype phasing problem with a reference panel. Leveraging our previous results on positional substrings, we propose an approach that can explain a query genotype if the corresponding haplotype pair exists in the input panel. Moreover, our method is extended to cases where such a pair does not exist, even though some regions should remain unphased if they cannot be explicitly explained using the reference panel. We implemented this method and compared it against Beagle, a state-of-the-art phasing tool, demonstrating that, in the absence of mutations and recombinations, our approach correctly identifies the haplotype pair that explains a genotype query while using seven times less memory than Beagle. However, we also observe that as mutation rates increase, the quality of the phasing decreases as a result of the growing difficulty of identifying consistent haplotype pairs in the presence of sequence variation. These findings highlight the potential of μ-PBWT as an efficient alternative for genotype phasing, particularly in settings where computational resources are limited. The source code is publicly available at https://github.com/dlcgold/muPBWT/tree/phase.
Cozzi, D., Bonizzoni, P., Boucher, C., Langmead, B., Pirola, Y. (2025). Phasing Data from Genotype Queries via the µ-PBWT. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing [10.4230/OASIcs.Manzini.10].
Phasing Data from Genotype Queries via the µ-PBWT
Cozzi D.
;Bonizzoni P.;Pirola Y.
2025
Abstract
Genotype phasing – the process of reconstructing haplotypes from genotype data – is a fundamental problem in genomics with applications in ancestry inference, imputation, and disease association. Traditional phasing methods rely on statistical models or combinatorial approaches which can be computationally expensive, particularly when applied to large-scale reference panels. In this paper, we present a first exploration of using the μ-PBWT (a run-length encoded Positional Burrows–Wheeler Transform) to solve the genotype phasing problem with a reference panel. Leveraging our previous results on positional substrings, we propose an approach that can explain a query genotype if the corresponding haplotype pair exists in the input panel. Moreover, our method is extended to cases where such a pair does not exist, even though some regions should remain unphased if they cannot be explicitly explained using the reference panel. We implemented this method and compared it against Beagle, a state-of-the-art phasing tool, demonstrating that, in the absence of mutations and recombinations, our approach correctly identifies the haplotype pair that explains a genotype query while using seven times less memory than Beagle. However, we also observe that as mutation rates increase, the quality of the phasing decreases as a result of the growing difficulty of identifying consistent haplotype pairs in the presence of sequence variation. These findings highlight the potential of μ-PBWT as an efficient alternative for genotype phasing, particularly in settings where computational resources are limited. The source code is publicly available at https://github.com/dlcgold/muPBWT/tree/phase.| File | Dimensione | Formato | |
|---|---|---|---|
|
Cozzi et al-2025-OpenAccess Series in Informatics-VoR.pdf
accesso aperto
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Licenza:
Creative Commons
Dimensione
1.11 MB
Formato
Adobe PDF
|
1.11 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


