Semantic Enrichment of Tabular Data with Machine Learning Techniques

Avogadro, R

Semantic Table Interpretation (STI) is one of the most widely used methods for identifying entities in tabular data. In this work, a methodology is delineated for implementing entity linking on a large scale using machine learning techniques, focusing on the challenges and solutions associated with handling vast amounts of data. The methodology is based on the concept of table-to-Knowledge Graph (KG) matching, which is a key step to enrich and extend Knowledge Graphs (KGs) from semi-structured data. Moreover, the intricacies of STI, EL, and the emerging landscape of Large Language Models (LLMs) and their potential application in STI and EL have also been touched upon. In addition, the impact of employing distinct KnowledgeGraphs, such as Wikidata and Wikipedia, in the context of EL using ChatGPT 4 has been demonstrated. In particular, the role of Human-in-the-Loop (HITL) techniques in enhancing model performance is explored. This paper outlines the foundational groundwork for the development of a scalable approach that adapts the existing acsti framework to handle large-scale data scenarios, thereby enhancing its operational efficiency and applicability.

La Interpretazione Semantica delle Tabelle (STI) è uno dei metodi più ampiamente utilizzati per identificare entità nei dati tabellari. In questo lavoro, viene delineata una metodologia per implementare il collegamento delle entità su larga scala utilizzando tecniche di apprendimento automatico, concentrandosi sulle sfide e le soluzioni associate alla gestione di grandi quantità di dati. La metodologia si basa sul concetto di corrispondenza tra tabelle e Knowledge Graph (KG), che è un passo fondamentale per arricchire ed estendere Knowledge Graph (KG) dai dati semistrutturati. Inoltre, sono stati affrontati i dettagli di STI, EL e del panorama emergente dei Large Language Models (LLM) e del loro potenziale applicazione in STI e EL. Inoltre, è stato dimostrato l'impatto dell'utilizzo di distinti Knowledge Graphs, come Wikidata e Wikipedia, nel contesto di EL utilizzando ChatGPT 4. In particolare, è esplorato il ruolo delle tecniche Human-in-the-Loop (HITL) nel migliorare le prestazioni del modello. Questo articolo delinea le basi fondamentali per lo sviluppo di un approccio scalabile che adatta il framework esistente acsti per gestire scenari di dati su larga scala, migliorando così la sua efficienza operativa e la sua applicabilità.

(2024). Semantic Enrichment of Tabular Data with Machine Learning Techniques. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2024).