Bicocca Open Archive

Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call “feature” a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.

Chicco, D., Oneto, L., Tavazzi, E. (2022). Eleven quick tips for data cleaning and feature engineering. PLOS COMPUTATIONAL BIOLOGY, 18(12), 1-21 [10.1371/journal.pcbi.1010718].

Eleven quick tips for data cleaning and feature engineering

Chicco, D;Oneto, L;Tavazzi, E

2022

Abstract

Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call “feature” a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				Computational Biology; Engineering; Humans; Machine Learning
			
	Lingua del contenuto
	
				English
			
	Data di pubblicazione
	
				2022
			
	Rivista
	
				PLOS COMPUTATIONAL BIOLOGY
			
	Numero del volume
	
				18
			
	Fascicolo
	
				12
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				21
			
	Article number
	
				e1010718
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1371/journal.pcbi.1010718
			
	Fulltext
	
				open
			
	Citazione
	
				Chicco, D., Oneto, L., Tavazzi, E. (2022). Eleven quick tips for data cleaning and feature engineering. PLOS COMPUTATIONAL BIOLOGY, 18(12), 1-21 [10.1371/journal.pcbi.1010718].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Chicco-2022-Plos Computat Biol-VoR.pdf accesso aperto Descrizione: Article Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 507.09 kB Formato Adobe PDF Visualizza/Apri	507.09 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/430240

Citazioni

53

40

Social impact