Machine learning models often require numerical inputs, making the encoding of categorical features a critical step in the data preprocessing pipeline. A wide range of encoding methods, such as the commonly used one-hot encoding, are available, but they may not always be optimal due to increased dimensionality and a lack of sensitivity to the inherent relationships between categories. This paper presents a comprehensive evaluation of 26 categorical encoding techniques, benchmarked across 13 real-world datasets and 7 different machine learning algorithms. Our study categorizes these methods based on predictive task type, model performance, and computational efficiency, offering a taxonomy for selecting encoders. In addition, we illustrate how Safe AI metrics can be applied to encoding pipelines, showing that they provide complementary insights into model robustness and fairness. Finally, we provide a Python tool called EncodeHero that enables researchers and practitioners to (1) extend the results by augmenting the benchmark with their own data and (2) choose the best encoding methodology based on their data and technical constraints.
Clerici, F., Nobani, N. (2026). Categorical variable encoding methods for tabular data: a benchmarking study. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 22(1) [10.1007/s41060-025-00886-w].
Categorical variable encoding methods for tabular data: a benchmarking study
Nobani, N
2026
Abstract
Machine learning models often require numerical inputs, making the encoding of categorical features a critical step in the data preprocessing pipeline. A wide range of encoding methods, such as the commonly used one-hot encoding, are available, but they may not always be optimal due to increased dimensionality and a lack of sensitivity to the inherent relationships between categories. This paper presents a comprehensive evaluation of 26 categorical encoding techniques, benchmarked across 13 real-world datasets and 7 different machine learning algorithms. Our study categorizes these methods based on predictive task type, model performance, and computational efficiency, offering a taxonomy for selecting encoders. In addition, we illustrate how Safe AI metrics can be applied to encoding pipelines, showing that they provide complementary insights into model robustness and fairness. Finally, we provide a Python tool called EncodeHero that enables researchers and practitioners to (1) extend the results by augmenting the benchmark with their own data and (2) choose the best encoding methodology based on their data and technical constraints.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


