Bicocca Open Archive

Motivation: The size of current protein databases is a challenge for many Bioinformatics applications, both in terms of processing speed and information redundancy. It may be therefore desirable to efficiently reduce the database of interest to a maximally representative subset. Results: The MinSet method employs a combination of a Suffix Tree and a Genetic Algorithm for the generation, selection and assessment of database subsets. The approach is generally applicable to any type of string-encoded data, allowing for a drastic reduction of the database size whilst retaining most of the information contained in the original set. We demonstrate the performance of the method on a database of protein domain structures encoded as strings. We used the SCOP40 domain database by translating protein structures into character strings by means of a structural alphabet and by extracting optimized subsets according to an entropy score that is based on a constant-length fragment dictionary. Therefore, optimized subsets are maximally representative for the distribution and range of local structures. Subsets containing only 10% of the SCOP structure classes show a coverage of >90% for fragments of length 1-4. Availability: http://mathbio.nimr.mrc.ac.uk/~jkleinj/MinSet

Pandini, A., Bonati, L., Fraternali, F., Kleinjung, J. (2007). MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database. BIOINFORMATICS, 23(4), 515-516 [10.1093/bioinformatics/btl637].

MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database

Pandini, A;BONATI, LAURA;Fraternali, F;Kleinjung, J.

2007

Abstract

Motivation: The size of current protein databases is a challenge for many Bioinformatics applications, both in terms of processing speed and information redundancy. It may be therefore desirable to efficiently reduce the database of interest to a maximally representative subset. Results: The MinSet method employs a combination of a Suffix Tree and a Genetic Algorithm for the generation, selection and assessment of database subsets. The approach is generally applicable to any type of string-encoded data, allowing for a drastic reduction of the database size whilst retaining most of the information contained in the original set. We demonstrate the performance of the method on a database of protein domain structures encoded as strings. We used the SCOP40 domain database by translating protein structures into character strings by means of a structural alphabet and by extracting optimized subsets according to an entropy score that is based on a constant-length fragment dictionary. Therefore, optimized subsets are maximally representative for the distribution and range of local structures. Subsets containing only 10% of the SCOP structure classes show a coverage of >90% for fragments of length 1-4. Availability: http://mathbio.nimr.mrc.ac.uk/~jkleinj/MinSet

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				protein structures: structural alphabet; database subset
			
	Lingua del contenuto
	
				English
			
	Data di pubblicazione
	
				dic-2007
			
	Rivista
	
				BIOINFORMATICS
			
	Numero del volume
	
				23
			
	Fascicolo
	
				4
			
	Pagina iniziale
	
				515
			
	Pagina finale
	
				516
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1093/bioinformatics/btl637
			
	Fulltext
	
				none
			
	Citazione
	
				Pandini, A., Bonati, L., Fraternali, F., Kleinjung, J. (2007). MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database. BIOINFORMATICS, 23(4), 515-516 [10.1093/bioinformatics/btl637].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/3233

Citazioni

12

12

Social impact