Missing data represents one of the most ubiquitous data quality issues, and also one of the most impactful on machine learning (ML) pipelines. Indeed, not only most commonly applied ML methods cannot directly employ incomplete data, but also the techniques employed to manage this issue can impact on the performance and evaluation of ML models. Among such techniques to manage missing data, imputation, that is filling in the missing values using information from the observed data, remains among the most popular and effective in practice. Yet, from a theoretical point of view, it is still not clear under which conditions it is possible to learn effectively after imputation. In this article we address this gap by studying learnability under imputation in the framework of statistical learning theory. After giving a general definition of learnability under imputation, we show three main contributions: 1) we introduce a novel stability condition, called noise risk stability, which we prove to be both sufficient and, under weak assumptions, necessary for learnability under imputation; 2) we show that a large class of ML models (including linear and kernel methods) satisfies noise risk stability; 3) we characterize the learning-theoretic properties of two common imputation methods (constant and regression imputation). Our results set the stage for a rigorous study of imputation and missing data management in the framework of statistical learning theory, by also describing relevant open questions.
Campagner, A. (2026). Missing but Not Missed: On Learnability Under Imputation. In Machine Learning and Knowledge Discovery in Databases. Research Track European Conference, ECML PKDD 2025, Porto, Portugal, September 15–19, 2025, Proceedings, Part IV (pp.344-361). Springer Science and Business Media Deutschland GmbH [10.1007/978-3-032-06078-5_20].
Missing but Not Missed: On Learnability Under Imputation
Campagner A.
Primo
2026
Abstract
Missing data represents one of the most ubiquitous data quality issues, and also one of the most impactful on machine learning (ML) pipelines. Indeed, not only most commonly applied ML methods cannot directly employ incomplete data, but also the techniques employed to manage this issue can impact on the performance and evaluation of ML models. Among such techniques to manage missing data, imputation, that is filling in the missing values using information from the observed data, remains among the most popular and effective in practice. Yet, from a theoretical point of view, it is still not clear under which conditions it is possible to learn effectively after imputation. In this article we address this gap by studying learnability under imputation in the framework of statistical learning theory. After giving a general definition of learnability under imputation, we show three main contributions: 1) we introduce a novel stability condition, called noise risk stability, which we prove to be both sufficient and, under weak assumptions, necessary for learnability under imputation; 2) we show that a large class of ML models (including linear and kernel methods) satisfies noise risk stability; 3) we characterize the learning-theoretic properties of two common imputation methods (constant and regression imputation). Our results set the stage for a rigorous study of imputation and missing data management in the framework of statistical learning theory, by also describing relevant open questions.| File | Dimensione | Formato | |
|---|---|---|---|
|
Campagner-2026-ECML PKDD 2025-VoR.pdf
Solo gestori archivio
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Licenza:
Tutti i diritti riservati
Dimensione
688.27 kB
Formato
Adobe PDF
|
688.27 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


