Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Bonizzoni, P; De Felice, C; Pirola, Y; Rizzi, R; Zaccagnino, R; Zizza, R

doi:10.1007/978-3-031-05578-2_1

Graph pangenomics is a new emerging field in computational biology that is changing the traditional view of a reference genome from a linear sequence to a new paradigm: a sequence graph (pangenome graph or simply pangenome) that represents the main similarities and differences in multiple evolutionary related genomes. The speed in producing large amounts of genome data, driven by advances in sequencing technologies, is far from the slow progress in developing new methods for constructing and analyzing a pangenome. Most recent advances in the field are still based on notions rooted in established and quite old literature on combinatorics on words, formal languages and space efficient data structures. In this paper we discuss two novel notions that may help in managing and analyzing multiple genomes by addressing a relevant question: how can we summarize sequence similarities and dissimilarities in large sequence data? The first notion is related to variants of the Lyndon factorization and allows to represent sequence similarities for a sample of reads, while the second one is that of sample specific string as a tool to detect differences in a sample of reads. New perspectives opened by these two notions are discussed.

Bonizzoni, P., De Felice, C., Pirola, Y., Rizzi, R., Zaccagnino, R., Zizza, R. (2022). Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?. In Developments in Language Theory (pp.3-12). Cham : Springer [10.1007/978-3-031-05578-2_1].

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Bonizzoni, Paola;De Felice, Clelia;Pirola, Yuri;Rizzi, Raffaella;Zaccagnino, Rocco;Zizza, Rosalba

2022

Abstract

Graph pangenomics is a new emerging field in computational biology that is changing the traditional view of a reference genome from a linear sequence to a new paradigm: a sequence graph (pangenome graph or simply pangenome) that represents the main similarities and differences in multiple evolutionary related genomes. The speed in producing large amounts of genome data, driven by advances in sequencing technologies, is far from the slow progress in developing new methods for constructing and analyzing a pangenome. Most recent advances in the field are still based on notions rooted in established and quite old literature on combinatorics on words, formal languages and space efficient data structures. In this paper we discuss two novel notions that may help in managing and analyzing multiple genomes by addressing a relevant question: how can we summarize sequence similarities and dissimilarities in large sequence data? The first notion is related to variants of the Lyndon factorization and allows to represent sequence similarities for a sample of reads, while the second one is that of sample specific string as a tool to detect differences in a sample of reads. New perspectives opened by these two notions are discussed.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
				paper
			
	Parole chiave
	
				Lyndon factorization, pangenomics, bioinformatics, formal languages
			
	Lingua del contenuto
	
				English
			
	Nome del convegno
	
				26th International Conference on Developments in Language Theory, DLT 2022 - 9 May 2022 through 13 May 2022
			
	Anno del convegno
	
				2022
			
	Titolo degli atti
	
				Developments in Language Theory
			
	ISBN del volume degli atti
	
				978-3-031-05577-5
			
	Collana o serie
	
				LECTURE NOTES IN COMPUTER SCIENCE
			
	Data di pubblicazione
	
				2022
			
	Numero del volume
	
				13257
			
	Pagina iniziale
	
				3
			
	Pagina finale
	
				12
			
	DOI dell'intervento
	
				https://dx.doi.org/10.1007/978-3-031-05578-2_1
			
	Fulltext
	
				open
			
	Citazione
	
				Bonizzoni, P., De Felice, C., Pirola, Y., Rizzi, R., Zaccagnino, R., Zizza, R. (2022). Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?. In Developments in Language Theory (pp.3-12). Cham : Springer [10.1007/978-3-031-05578-2_1].
			
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

File	Dimensione	Formato
main.pdf accesso aperto Descrizione: Author submitted version Tipologia di allegato: Submitted Version (Pre-print) Dimensione 282.32 kB Formato Adobe PDF Visualizza/Apri	282.32 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/378700

Citazioni

4

2

Bicocca Open Archive

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Bonizzoni, Paola;De Felice, Clelia;Pirola, Yuri;Rizzi, Raffaella;Zaccagnino, Rocco;Zizza, Rosalba

2022

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

Social impact

Bicocca Open Archive

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Bonizzoni, Paola;De Felice, Clelia;Pirola, Yuri;Rizzi, Raffaella;Zaccagnino, Rocco;Zizza, Rosalba

2022

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Citazioni

Social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)