Missing data represents one of the most ubiquitous data quality issues, and also one of the most impactful on machine learning (ML) pipelines. Indeed, not only most commonly applied ML methods cannot directly employ incomplete data, but also the techniques employed to manage this issue can impact on the performance and evaluation of ML models. Among such techniques to manage missing data, imputation, that is filling in the missing values using information from the observed data, remains among the most popular and effective in practice. Yet, from a theoretical point of view, it is still not clear under which conditions it is possible to learn effectively after imputation. In this article we address this gap by studying learnability under imputation in the framework of statistical learning theory. After giving a general definition of learnability under imputation, we show three main contributions: 1) we introduce a novel stability condition, called noise risk stability, which we prove to be both sufficient and, under weak assumptions, necessary for learnability under imputation; 2) we show that a large class of ML models (including linear and kernel methods) satisfies noise risk stability; 3) we characterize the learning-theoretic properties of two common imputation methods (constant and regression imputation). Our results set the stage for a rigorous study of imputation and missing data management in the framework of statistical learning theory, by also describing relevant open questions.

Campagner, A. (2026). Missing but Not Missed: On Learnability Under Imputation. In Machine Learning and Knowledge Discovery in Databases. Research Track European Conference, ECML PKDD 2025, Porto, Portugal, September 15–19, 2025, Proceedings, Part IV (pp.344-361). Springer Science and Business Media Deutschland GmbH [10.1007/978-3-032-06078-5_20].

Missing but Not Missed: On Learnability Under Imputation

Campagner A.
Primo
2026

Abstract

Missing data represents one of the most ubiquitous data quality issues, and also one of the most impactful on machine learning (ML) pipelines. Indeed, not only most commonly applied ML methods cannot directly employ incomplete data, but also the techniques employed to manage this issue can impact on the performance and evaluation of ML models. Among such techniques to manage missing data, imputation, that is filling in the missing values using information from the observed data, remains among the most popular and effective in practice. Yet, from a theoretical point of view, it is still not clear under which conditions it is possible to learn effectively after imputation. In this article we address this gap by studying learnability under imputation in the framework of statistical learning theory. After giving a general definition of learnability under imputation, we show three main contributions: 1) we introduce a novel stability condition, called noise risk stability, which we prove to be both sufficient and, under weak assumptions, necessary for learnability under imputation; 2) we show that a large class of ML models (including linear and kernel methods) satisfies noise risk stability; 3) we characterize the learning-theoretic properties of two common imputation methods (constant and regression imputation). Our results set the stage for a rigorous study of imputation and missing data management in the framework of statistical learning theory, by also describing relevant open questions.
paper
Imputation; Learnability; Missing Data; Statistical Learning Theory;
English
European Conference, ECML PKDD 2025 - September 15–19, 2025
2025
Ribeiro, RP; Pfahringer, B; Japkowicz, N; Larrañaga, P; Jorge, AM; Soares, C; Abreu, PH; Gama, J
Machine Learning and Knowledge Discovery in Databases. Research Track European Conference, ECML PKDD 2025, Porto, Portugal, September 15–19, 2025, Proceedings, Part IV
9783032060778
30-set-2025
2026
16016 LNCS
344
361
reserved
Campagner, A. (2026). Missing but Not Missed: On Learnability Under Imputation. In Machine Learning and Knowledge Discovery in Databases. Research Track European Conference, ECML PKDD 2025, Porto, Portugal, September 15–19, 2025, Proceedings, Part IV (pp.344-361). Springer Science and Business Media Deutschland GmbH [10.1007/978-3-032-06078-5_20].
File in questo prodotto:
File Dimensione Formato  
Campagner-2026-ECML PKDD 2025-VoR.pdf

Solo gestori archivio

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Tutti i diritti riservati
Dimensione 688.27 kB
Formato Adobe PDF
688.27 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/574841
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
Social impact