Bicocca Open Archive

Indexing pangenome graphs is a major algorithmic challenge in computational pangenomics, a recent and active research field that seeks to use graphs as representations of multiple genomes. Since these graphs are constructed from whole genome sequences of a species population, they can become very large, making indexing one of the most challenging problems. In this paper, we propose gindex, a novel indexing approach to solve the Graph Pattern Matching Problem based on the multidollar-BWT. Specifically, gindex aims to find all occurrences of a pattern in a sequence-labeled graph by overcoming two main limitations of GCSA2, one of the most widely used graph indexes: handling queries of arbitrary length and scaling to large graphs without pruning any complex regions. Moreover, we show how a smart preprocessing step can optimize the use of multidollar-BWT to skip small redundant sub-patterns and enhance gindex’s querying capabilities. We demonstrate the effectiveness of our approach by comparing it to GCSA2 in terms of index construction and query time, using different preprocessing modes on three pangenome graphs: one built from Drosophila genomes and two produced by the Human Pangenome Reference Consortium. The results show that gindex can scale on human pangenome graphs - which GCSA2 cannot index using large amounts of RAM - with acceptable memory and time requirements. Moreover, gindex achieves fast query times, although not as fast as GCSA2, which may produce false positives.

Cozzi, D., Riccardi, B., Denti, L., Ciccolella, S., Sadakane, K., Bonizzoni, P. (2025). Pangenome Graph Indexing via the Multidollar-BWT. In Leibniz International Proceedings in Informatics, LIPIcs. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing [10.4230/LIPIcs.SEA.2025.13].

Pangenome Graph Indexing via the Multidollar-BWT

Cozzi D.;Riccardi B.;Denti L.;Ciccolella S.;Sadakane K.;Bonizzoni P.

2025

Abstract

Indexing pangenome graphs is a major algorithmic challenge in computational pangenomics, a recent and active research field that seeks to use graphs as representations of multiple genomes. Since these graphs are constructed from whole genome sequences of a species population, they can become very large, making indexing one of the most challenging problems. In this paper, we propose gindex, a novel indexing approach to solve the Graph Pattern Matching Problem based on the multidollar-BWT. Specifically, gindex aims to find all occurrences of a pattern in a sequence-labeled graph by overcoming two main limitations of GCSA2, one of the most widely used graph indexes: handling queries of arbitrary length and scaling to large graphs without pruning any complex regions. Moreover, we show how a smart preprocessing step can optimize the use of multidollar-BWT to skip small redundant sub-patterns and enhance gindex’s querying capabilities. We demonstrate the effectiveness of our approach by comparing it to GCSA2 in terms of index construction and query time, using different preprocessing modes on three pangenome graphs: one built from Drosophila genomes and two produced by the Human Pangenome Reference Consortium. The results show that gindex can scale on human pangenome graphs - which GCSA2 cannot index using large amounts of RAM - with acceptable memory and time requirements. Moreover, gindex achieves fast query times, although not as fast as GCSA2, which may produce false positives.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
				paper
			
	Parole chiave
	
				Graph Index; Graph Pattern Matching; Multidollar-BWT; Pangenome Graph;
			
	Lingua del contenuto
	
				English
			
	Nome del convegno
	
				23rd International Symposium on Experimental Algorithms, SEA 2025 - 22 July 2025 - 24 July 2025
			
	Anno del convegno
	
				2025
			
	Curatori della monografia
	
				Mutzel, P; Prezza, N
			
	Titolo degli atti
	
				Leibniz International Proceedings in Informatics, LIPIcs
			
	ISBN del volume degli atti
	
				9783959773751
			
	Collana o serie
	
				LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS
			
	Data di pubblicazione
	
				2025
			
	Numero del volume
	
				338
			
	Article number
	
				13
			
	DOI dell'intervento
	
				https://dx.doi.org/10.4230/LIPIcs.SEA.2025.13
			
	Fulltext
	
				open
			
	Citazione
	
				Cozzi, D., Riccardi, B., Denti, L., Ciccolella, S., Sadakane, K., Bonizzoni, P. (2025). Pangenome Graph Indexing via the Multidollar-BWT. In Leibniz International Proceedings in Informatics, LIPIcs. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing [10.4230/LIPIcs.SEA.2025.13].
			
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

File	Dimensione	Formato
Cozzi et al-2025-SEA-VoR.pdf accesso aperto Descrizione: Pangenome Graph Indexing via the Multidollar-BWT Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 882.32 kB Formato Adobe PDF Visualizza/Apri	882.32 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/568802

Citazioni

0

ND

Social impact