Indexing pangenome graphs is a major algorithmic challenge in computational pangenomics, a recent and active research field that seeks to use graphs as representations of multiple genomes. Since these graphs are constructed from whole genome sequences of a species population, they can become very large, making indexing one of the most challenging problems. In this paper, we propose gindex, a novel indexing approach to solve the Graph Pattern Matching Problem based on the multidollar-BWT. Specifically, gindex aims to find all occurrences of a pattern in a sequence-labeled graph by overcoming two main limitations of GCSA2, one of the most widely used graph indexes: handling queries of arbitrary length and scaling to large graphs without pruning any complex regions. Moreover, we show how a smart preprocessing step can optimize the use of multidollar-BWT to skip small redundant sub-patterns and enhance gindex’s querying capabilities. We demonstrate the effectiveness of our approach by comparing it to GCSA2 in terms of index construction and query time, using different preprocessing modes on three pangenome graphs: one built from Drosophila genomes and two produced by the Human Pangenome Reference Consortium. The results show that gindex can scale on human pangenome graphs - which GCSA2 cannot index using large amounts of RAM - with acceptable memory and time requirements. Moreover, gindex achieves fast query times, although not as fast as GCSA2, which may produce false positives.
Cozzi, D., Riccardi, B., Denti, L., Ciccolella, S., Sadakane, K., Bonizzoni, P. (2025). Pangenome Graph Indexing via the Multidollar-BWT. In Leibniz International Proceedings in Informatics, LIPIcs. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing [10.4230/LIPIcs.SEA.2025.13].
Pangenome Graph Indexing via the Multidollar-BWT
Cozzi D.
;Riccardi B.;Denti L.;Ciccolella S.;Bonizzoni P.
2025
Abstract
Indexing pangenome graphs is a major algorithmic challenge in computational pangenomics, a recent and active research field that seeks to use graphs as representations of multiple genomes. Since these graphs are constructed from whole genome sequences of a species population, they can become very large, making indexing one of the most challenging problems. In this paper, we propose gindex, a novel indexing approach to solve the Graph Pattern Matching Problem based on the multidollar-BWT. Specifically, gindex aims to find all occurrences of a pattern in a sequence-labeled graph by overcoming two main limitations of GCSA2, one of the most widely used graph indexes: handling queries of arbitrary length and scaling to large graphs without pruning any complex regions. Moreover, we show how a smart preprocessing step can optimize the use of multidollar-BWT to skip small redundant sub-patterns and enhance gindex’s querying capabilities. We demonstrate the effectiveness of our approach by comparing it to GCSA2 in terms of index construction and query time, using different preprocessing modes on three pangenome graphs: one built from Drosophila genomes and two produced by the Human Pangenome Reference Consortium. The results show that gindex can scale on human pangenome graphs - which GCSA2 cannot index using large amounts of RAM - with acceptable memory and time requirements. Moreover, gindex achieves fast query times, although not as fast as GCSA2, which may produce false positives.| File | Dimensione | Formato | |
|---|---|---|---|
|
Cozzi et al-2025-SEA-VoR.pdf
accesso aperto
Descrizione: Pangenome Graph Indexing via the Multidollar-BWT
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Licenza:
Creative Commons
Dimensione
882.32 kB
Formato
Adobe PDF
|
882.32 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


