Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Bonizzoni, P; De Felice, C; Pirola, Y; Rizzi, R; Zaccagnino, R; Zizza, R

doi:10.1007/978-3-031-05578-2_1

Graph pangenomics is a new emerging field in computational biology that is changing the traditional view of a reference genome from a linear sequence to a new paradigm: a sequence graph (pangenome graph or simply pangenome) that represents the main similarities and differences in multiple evolutionary related genomes. The speed in producing large amounts of genome data, driven by advances in sequencing technologies, is far from the slow progress in developing new methods for constructing and analyzing a pangenome. Most recent advances in the field are still based on notions rooted in established and quite old literature on combinatorics on words, formal languages and space efficient data structures. In this paper we discuss two novel notions that may help in managing and analyzing multiple genomes by addressing a relevant question: how can we summarize sequence similarities and dissimilarities in large sequence data? The first notion is related to variants of the Lyndon factorization and allows to represent sequence similarities for a sample of reads, while the second one is that of sample specific string as a tool to detect differences in a sample of reads. New perspectives opened by these two notions are discussed.

Bonizzoni, P., De Felice, C., Pirola, Y., Rizzi, R., Zaccagnino, R., Zizza, R. (2022). Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?. In Developments in Language Theory (pp.3-12). Cham : Springer [10.1007/978-3-031-05578-2_1].

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Bonizzoni, Paola;De Felice, Clelia;Pirola, Yuri;Rizzi, Raffaella;Zaccagnino, Rocco;Zizza, Rosalba

2022

Abstract

Graph pangenomics is a new emerging field in computational biology that is changing the traditional view of a reference genome from a linear sequence to a new paradigm: a sequence graph (pangenome graph or simply pangenome) that represents the main similarities and differences in multiple evolutionary related genomes. The speed in producing large amounts of genome data, driven by advances in sequencing technologies, is far from the slow progress in developing new methods for constructing and analyzing a pangenome. Most recent advances in the field are still based on notions rooted in established and quite old literature on combinatorics on words, formal languages and space efficient data structures. In this paper we discuss two novel notions that may help in managing and analyzing multiple genomes by addressing a relevant question: how can we summarize sequence similarities and dissimilarities in large sequence data? The first notion is related to variants of the Lyndon factorization and allows to represent sequence similarities for a sample of reads, while the second one is that of sample specific string as a tool to detect differences in a sample of reads. New perspectives opened by these two notions are discussed.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
			paper
		
	Parole chiave
	
			Lyndon factorization, pangenomics, bioinformatics, formal languages
		
	Lingua del contenuto
	
			English
		
	Nome del convegno
	
			26th International Conference on Developments in Language Theory, DLT 2022 - 9 May 2022 through 13 May 2022
		
	Anno del convegno
	
			2022
		
	Titolo degli atti
	
			Developments in Language Theory
		
	ISBN del volume degli atti
	
			978-3-031-05577-5
		
	Collana o serie
	
			LECTURE NOTES IN COMPUTER SCIENCE
		
	Data di pubblicazione
	
			2022
		
	Numero del volume
	
			13257
		
	Pagina iniziale
	
			3
		
	Pagina finale
	
			12
		
	DOI dell'intervento
	
			https://dx.doi.org/10.1007/978-3-031-05578-2_1
		
	Fulltext
	
			open
		
	Citazione
	
			Bonizzoni, P., De Felice, C., Pirola, Y., Rizzi, R., Zaccagnino, R., Zizza, R. (2022). Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?. In Developments in Language Theory (pp.3-12). Cham : Springer [10.1007/978-3-031-05578-2_1].
		
	Appare nelle tipologie:
	
			02 - Intervento a convegno

File in questo prodotto:

File	Dimensione	Formato
main.pdf accesso aperto Descrizione: Author submitted version Tipologia di allegato: Submitted Version (Pre-print) Dimensione 282.32 kB Formato Adobe PDF Visualizza/Apri	282.32 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/378700

Citazioni

1

0

Bicocca Open Archive

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Bonizzoni, Paola;De Felice, Clelia;Pirola, Yuri;Rizzi, Raffaella;Zaccagnino, Rocco;Zizza, Rosalba

2022

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

Social impact

Bicocca Open Archive

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

Bonizzoni, Paola;De Felice, Clelia;Pirola, Yuri;Rizzi, Raffaella;Zaccagnino, Rocco;Zizza, Rosalba

2022

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Citazioni

Social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)