We live in the Information Age, where most of the personal, business, and administrative data are collected and managed electronically. However, poor data quality may affect the effectiveness of knowledge discovery processes, thus making the development of the data improvement steps a significant concern. In this paper we propose the Multidimensional Robust Data Quality Analysis, a domain-independent technique aimed to improve data quality by evaluating the effectiveness of a black-box cleansing function. Here, the proposed approach has been realized through model checking techniques and then applied on a weakly structured dataset describing the working careers of millions of people. Our experimental outcomes show the effectiveness of our model-based approach for data quality as they provide a fine-grained analysis of both the source dataset and the cleansing procedures, enabling domain experts to identify the most relevant quality issues as well as the action points for improving the cleansing activities. Finally, an anonymized version of the dataset and the analysis results have been made publicly available to the community.

Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F. (2015). A model-based evaluation of data quality activities in KDD. INFORMATION PROCESSING & MANAGEMENT, 51(2), 144-166 [10.1016/j.ipm.2014.07.007].

A model-based evaluation of data quality activities in KDD

MEZZANZANICA, MARIO
Primo
;
BOSELLI, ROBERTO
Secondo
;
CESARINI, MIRKO
Penultimo
;
MERCORIO, FABIO
Ultimo
2015

Abstract

We live in the Information Age, where most of the personal, business, and administrative data are collected and managed electronically. However, poor data quality may affect the effectiveness of knowledge discovery processes, thus making the development of the data improvement steps a significant concern. In this paper we propose the Multidimensional Robust Data Quality Analysis, a domain-independent technique aimed to improve data quality by evaluating the effectiveness of a black-box cleansing function. Here, the proposed approach has been realized through model checking techniques and then applied on a weakly structured dataset describing the working careers of millions of people. Our experimental outcomes show the effectiveness of our model-based approach for data quality as they provide a fine-grained analysis of both the source dataset and the cleansing procedures, enabling domain experts to identify the most relevant quality issues as well as the action points for improving the cleansing activities. Finally, an anonymized version of the dataset and the analysis results have been made publicly available to the community.
Articolo in rivista - Articolo scientifico
Data cleansing; Data quality; Model checking; Real-life application;
English
2015
51
2
144
166
partially_open
Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F. (2015). A model-based evaluation of data quality activities in KDD. INFORMATION PROCESSING & MANAGEMENT, 51(2), 144-166 [10.1016/j.ipm.2014.07.007].
File in questo prodotto:
File Dimensione Formato  
IPM2015.pdf

Solo gestori archivio

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Dimensione 3.18 MB
Formato Adobe PDF
3.18 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
IPM_post.pdf

accesso aperto

Tipologia di allegato: Author’s Accepted Manuscript, AAM (Post-print)
Dimensione 2.82 MB
Formato Adobe PDF
2.82 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/58799
Citazioni
  • Scopus 33
  • ???jsp.display-item.citation.isi??? 23
Social impact