Profiling Linked Data

Spahiu, B

Recently, the increasing diffusion of Linked Data (LD) as a standard way to publish and structure data on the Web has received a growing attention from researchers and data publishers. LD adoption is reflected in different domains such as government, media, life science, etc., building a powerful Web available to anyone. Despite the high number of datasets published as LD, their usage is still not exploited as they lack comprehensive metadata. Data consumers need to obtain information about datasets content in a fast and summarized form to decide if they are useful for their use case at hand or not. Data profiling techniques offer an efficient solution to this problem as they are used to generate metadata and statistics that describe the content of the dataset. Existing profiling techniques do no cover a wide range of use cases. Many challenges due to the heterogeneity nature of Linked Data are still to overcome. This thesis presents the doctoral research which tackles the problems related to Profiling Linked Data. Even though the term of data profiling is the umbrella term for diverse descriptive information that describes a dataset, in this thesis we cover three aspects of profiling; topic-based, schema-based and linkage-based. The profile provided in this thesis is fundamental for the decision-making process and is the basic requirement towards the dataset understanding. In this thesis we present an approach to automatically classify datasets in one of the topical categories used in the LD cloud. Moreover, we investigate the problem of multi-topic profiling. For the schema-based profiling we propose a schema-based summarization approach, that provides an overview about the relations in the data. Our summaries are concise and informative enough to summarize the whole dataset. Moreover, they reveal quality issues and can help users in the query formulation tasks. Many datasets in the LD cloud contain similar information for the same entity. In order to fully exploit its potential LD should made this information explicit. Linkage profiling provides information about the number of equivalent entities between datasets and reveal possible errors. The techniques of profiling developed during this work are automatic and can be applied to different datasets independently of the domain.

Nonostante l'elevato numero di dati pubblicati come LD, il loro utilizzo non ha ancora mostrato il loro potenziale per l’assenza di comprensione dei metadati. I consumatori di dati hanno bisogno di ottenere informazioni dai dataset in modo veloce e concentrato per poter decidere se sono utili per il loro problema oppure no. Le tecniche di profilazione dei dati offrono una soluzione efficace a questo problema in quanto sono utilizzati per generare metadati e statistiche che descrivono il contenuto dei dataset. Questa tesi presenta una ricerca, che affronta i problemi legati alla profilazione Linked Data. Nonostante il termine profilazione dei dati è usato in modo generico per diverse informazioni che descrivono i dataset, in questa tesi noi andiamo a ricoprire tre aspetti della profilazione; topic-based, schema-based e linkage-based. Il profilo proposto in questa tesi è fondamentale per il processo decisionale ed è la base dei requisiti che portano verso la comprensione dei dataset. In questa tesi presentiamo un approccio per classificare automaticamente insiemi di dati in una delle categorie utilizzate nel mondo dei LD. Inoltre, indaghiamo il problema della profilazione multi-topic. Per la profilazione schema-based proponiamo un approccio riassuntivo schema-based, che fornisce una panoramica sui rapporti nei dati. I nostri riassunti sono concisi e chiari sufficientemente per riassumere l'intero dataset. Inoltre, essi rivelano problemi di qualità e possono aiutare gli utenti nei compiti di formulazione dei query. Molti dataset nel LD cloud contengono informazioni simili per la stessa entità. Al fine di sfruttare appieno il suo potenziale LD bisogna far vedere questa informazione in modo esplicito. Profiling Linkage fornisce informazioni sul numero di entità equivalenti tra i dataset e rivela possibili errori.Le tecniche di profiling sviluppate durante questo lavoro sono automatiche e possono essere applicate a differenti insiemi di dati indipendentemente dal dominio.

(2017). Profiling Linked Data. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2017).