Bicocca Open Archive

Machine learning models often require numerical inputs, making the encoding of categorical features a critical step in the data preprocessing pipeline. A wide range of encoding methods, such as the commonly used one-hot encoding, are available, but they may not always be optimal due to increased dimensionality and a lack of sensitivity to the inherent relationships between categories. This paper presents a comprehensive evaluation of 26 categorical encoding techniques, benchmarked across 13 real-world datasets and 7 different machine learning algorithms. Our study categorizes these methods based on predictive task type, model performance, and computational efficiency, offering a taxonomy for selecting encoders. In addition, we illustrate how Safe AI metrics can be applied to encoding pipelines, showing that they provide complementary insights into model robustness and fairness. Finally, we provide a Python tool called EncodeHero that enables researchers and practitioners to (1) extend the results by augmenting the benchmark with their own data and (2) choose the best encoding methodology based on their data and technical constraints.

Clerici, F., Nobani, N. (2026). Categorical variable encoding methods for tabular data: a benchmarking study. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 22(1) [10.1007/s41060-025-00886-w].

Categorical variable encoding methods for tabular data: a benchmarking study

Clerici, F;Nobani, N

2026

Abstract

Machine learning models often require numerical inputs, making the encoding of categorical features a critical step in the data preprocessing pipeline. A wide range of encoding methods, such as the commonly used one-hot encoding, are available, but they may not always be optimal due to increased dimensionality and a lack of sensitivity to the inherent relationships between categories. This paper presents a comprehensive evaluation of 26 categorical encoding techniques, benchmarked across 13 real-world datasets and 7 different machine learning algorithms. Our study categorizes these methods based on predictive task type, model performance, and computational efficiency, offering a taxonomy for selecting encoders. In addition, we illustrate how Safe AI metrics can be applied to encoding pipelines, showing that they provide complementary insights into model robustness and fairness. Finally, we provide a Python tool called EncodeHero that enables researchers and practitioners to (1) extend the results by augmenting the benchmark with their own data and (2) choose the best encoding methodology based on their data and technical constraints.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				Categorical data; Tabular data; Variable encoding;
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				4-feb-2026
			
	Data di pubblicazione
	
				2026
			
	Rivista
	
				INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS
			
	Numero del volume
	
				22
			
	Fascicolo
	
				1
			
	Article number
	
				55
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1007/s41060-025-00886-w
			
	Fulltext
	
				none
			
	Citazione
	
				Clerici, F., Nobani, N. (2026). Categorical variable encoding methods for tabular data: a benchmarking study. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 22(1) [10.1007/s41060-025-00886-w].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/589521

Citazioni

2

0

Social impact