Improving Data Cleansing Techniques on Administrative Databases

Boselli, R; Cesarini, M; Mercorio, F; Mezzanzanica, M

Business and governmental applications, web applications, ongoing relations between citizens and public administrations generate a lot of data, whereas a relevant subset can be considered as longitudinal data. Such data are often used in several decision making activities in the context of active policies design and implementation, resource allocation, and service design and improvement. Unfortunately, the lower the quality of the data, the lower the reliability of the information derived thereon. Hence, data cleansing activities play a key role in ensuring the effectiveness of the decision making process. In the last decade a great effort has been made by both industrial and academic communities in developing algorithms and tools to assess the data quality, by dealing with a wide range of dimensions (e.g., consistency, accuracy, believability) in several fields (e.g., government, statistics, computer science). Nevertheless, scalability issues often affect theoretical methods since the size of real case datasets is often huge, while the lack of formality of a lot of data cleansing techniques may affect the cleansed data reliability. Therefore, the application of such approaches to real-world domains still represents a challenging issue. This work is aimed to exploit both empirical and theoretical approaches by combining their capabilities in assessing and improving the data cleansing procedures, providing experimental results in a motivating application domain. We focus on a scenario where the well-known ETL (Extract, Transform, Load) technique has been used to generate a new (cleansed) dataset from the original one. Then, we enhanced the ETL features by assessing the results through the Robust Data Quality Analysis (RDQA), a model-checking-based technique implemented to evaluate the consistency of both the source dataset and the cleansed one, providing useful insights on how the ETL procedures could be improved. We used this methodology to a real application domain, namely the "Mandatory Notification System", designed by the Italian Ministry of Labour and Welfare. The system stores data concerning employment and active labour market policies for the Italian inhabitants. Such data are stored in several databases managed at territorial level. In such a context, the data used for the decision making by policy makers and civil servants should be carefully and effectively managed, given the social relevance of the labour market dynamics. We evaluated our approach on a database containing more than 5,5 million people career data, i.e. the citizens living in an Italian Region. Thanks to the joint exploitation of both the ETL and the RDQA techniques, we performed a fine-grained evaluation of the data cleansing results.

Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M. (2013). Improving Data Cleansing Techniques on Administrative Databases. In Proceedings of the 13th European Conference on e-Government (pp.85-93). Academic Conferences and Publishing International.