Indexing pangenome graphs is a major algorithmic challenge in computational pangenomics, a recent and active research field that seeks to use graphs as representations of multiple genomes. Since these graphs are constructed from whole genome sequences of a species population, they can become very large, making indexing one of the most challenging problems. In this paper, we propose gindex, a novel indexing approach to solve the Graph Pattern Matching Problem based on the multidollar-BWT. Specifically, gindex aims to find all occurrences of a pattern in a sequence-labeled graph by overcoming two main limitations of GCSA2, one of the most widely used graph indexes: handling queries of arbitrary length and scaling to large graphs without pruning any complex regions. Moreover, we show how a smart preprocessing step can optimize the use of multidollar-BWT to skip small redundant sub-patterns and enhance gindex’s querying capabilities. We demonstrate the effectiveness of our approach by comparing it to GCSA2 in terms of index construction and query time, using different preprocessing modes on three pangenome graphs: one built from Drosophila genomes and two produced by the Human Pangenome Reference Consortium. The results show that gindex can scale on human pangenome graphs - which GCSA2 cannot index using large amounts of RAM - with acceptable memory and time requirements. Moreover, gindex achieves fast query times, although not as fast as GCSA2, which may produce false positives.

Cozzi, D., Riccardi, B., Denti, L., Ciccolella, S., Sadakane, K., Bonizzoni, P. (2025). Pangenome Graph Indexing via the Multidollar-BWT. In Leibniz International Proceedings in Informatics, LIPIcs. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing [10.4230/LIPIcs.SEA.2025.13].

Pangenome Graph Indexing via the Multidollar-BWT

Cozzi D.
;
Riccardi B.;Denti L.;Ciccolella S.;Bonizzoni P.
2025

Abstract

Indexing pangenome graphs is a major algorithmic challenge in computational pangenomics, a recent and active research field that seeks to use graphs as representations of multiple genomes. Since these graphs are constructed from whole genome sequences of a species population, they can become very large, making indexing one of the most challenging problems. In this paper, we propose gindex, a novel indexing approach to solve the Graph Pattern Matching Problem based on the multidollar-BWT. Specifically, gindex aims to find all occurrences of a pattern in a sequence-labeled graph by overcoming two main limitations of GCSA2, one of the most widely used graph indexes: handling queries of arbitrary length and scaling to large graphs without pruning any complex regions. Moreover, we show how a smart preprocessing step can optimize the use of multidollar-BWT to skip small redundant sub-patterns and enhance gindex’s querying capabilities. We demonstrate the effectiveness of our approach by comparing it to GCSA2 in terms of index construction and query time, using different preprocessing modes on three pangenome graphs: one built from Drosophila genomes and two produced by the Human Pangenome Reference Consortium. The results show that gindex can scale on human pangenome graphs - which GCSA2 cannot index using large amounts of RAM - with acceptable memory and time requirements. Moreover, gindex achieves fast query times, although not as fast as GCSA2, which may produce false positives.
paper
Graph Index; Graph Pattern Matching; Multidollar-BWT; Pangenome Graph;
English
23rd International Symposium on Experimental Algorithms, SEA 2025 - 22 July 2025 - 24 July 2025
2025
Mutzel, P; Prezza, N
Leibniz International Proceedings in Informatics, LIPIcs
9783959773751
2025
338
13
open
Cozzi, D., Riccardi, B., Denti, L., Ciccolella, S., Sadakane, K., Bonizzoni, P. (2025). Pangenome Graph Indexing via the Multidollar-BWT. In Leibniz International Proceedings in Informatics, LIPIcs. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing [10.4230/LIPIcs.SEA.2025.13].
File in questo prodotto:
File Dimensione Formato  
Cozzi et al-2025-SEA-VoR.pdf

accesso aperto

Descrizione: Pangenome Graph Indexing via the Multidollar-BWT
Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Creative Commons
Dimensione 882.32 kB
Formato Adobe PDF
882.32 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/568802
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
Social impact