Towards Adaptation of Named Entity Recognition and Linking Frameworks

Manchanda, P

Natural Language Processing and Knowledge Base Experts are actively involved in extracting structured information from the Unstructured Web in order to realize the Semantic Web Vision. Diverse forms of unstructured information is easily available today to research scientists from social media platforms such as Twitter and Facebook in real time. %Knowledge discovery from the diverse forms of unstructured information available today from social media platforms such as Twitter and Facebook in real time, thus, plays a key role for this goal. The comprehensive and widespread use of such platforms in the modern age has led to a continuous stream of evolving information along with a constant presence of noise, and ambiguity which makes the task of extracting structured information difficult. An essential step is therefore identification of relevant information from the point of view of knowledge base enrichment. As a result, research efforts towards Information Extraction and Natural Language Processing Frameworks have increased significantly over the past decade, Named Entity Extraction and Linking (NEEL) Frameworks being one of the very prevalent ones. Numerous NEEL frameworks exist today, however, mostly for commercial purposes. The orchestration of components of a NEEL framework, i.e., named entity recognition component, named entity disambiguation and named entity linking component, for microblogging platforms such as Twitter and Facebook is difficult in particular due to the type of text under consideration. As a result, there is little research in the use and improvement of such components towards a more robust framework that can be adapted to emerging information in real time. This thesis discusses the challenges faced by conventional NEEL frameworks when faced with textual formats such as tweets and investigates several approaches to improve the performance of the components and of the NEEL framework as a whole. A key hypothesis is that the performance of such a framework depreciates when dealing with social media platforms, and if one component can be used to improve the performance of the other, the overall performance can be improved as well. Supervised and unsupervised techniques have been investigated in this thesis to this end, which prove to be effective in increasing the overall accuracy of the framework when faced with noisy and ambiguous textual formats from the microblogging platform of Twitter.

L'estrazione di informazioni strutturate a partire dal “web non strutturato”, ha suscitato un notevole interesse da parte delle comunità scientifiche che si occupano di elaborazione del linguaggio naturale e di sistemi basati sulla conoscenza per sviluppare a pieno la visione del “web semantico”. Nell'era moderna, l'uso pervasivo e diffuso delle reti sociali ha portato alla produzione di un flusso continuo di informazioni su piattaforme quali Twitter o Facebook, definite anche piattaforme di microblogging. Tali sorgenti informative, accessibili in tempo reale, producono informazioni caratterizzate dalla presenza costante di rumore e ambiguità linguistiche che rendono particolarmente difficoltoso il compito di estrarre informazioni strutturate. Tale estrazione è tuttavia cruciale per poter arricchire grandi basi di conoscenza, oggi usate in molte applicazioni industriali e di ricerca, con informazioni nuove e rilevanti. Ne risulta che nell'ultimo decennio sono aumentati significativamente gli sforzi della ricerca nel campo dell’elaborazione del linguaggio naturale per l'estrazione di informazioni da piattaforme di microblogging, con particolare attenzione nei confronti dell’estrazione e identificazione di entità nominali (anche Named Entity Extraction and Linking o NEEL). Oggigiorno esistono numerosi sistemi di NEEL, di cui la maggior parte però creati a scopo commerciale. La calibrazione dei componenti di un sistema di NEEL, cioè dei componenti per la rilevazione, la disambiguazione e l'identificazione di entità nominali, nel caso di piattaforme di microblogging come Twitter e Facebook è difficile in particolare a causa delle tipologie di testo considerato. Mancano approcci di ricerca sistematici volti a guidare l'utilizzo e il miglioramento di tali componenti, per la realizzazione di sistemi più robusti, in grado di meglio adattarsi all’emergere di nuove informazioni, e nuovi interessi (ad esempio, a estrarre tipi di entità nuovi rispetto a quelli considerati in passato). La presente tesi discute le sfide affrontate dai sistemi tradizionali di NEEL qualora questi si misurino con formati di testo quali i tweet, ed esplora vari approcci per migliorare le prestazioni dei singoli componenti e di un sistema di NEEL nel suo insieme. L'ipotesi chiave del presente lavoro di tesi è che sia possibile costruire sistemi robusti usando dove possibile, componenti esistenti, e che la prestazione di un sistema nel suo complesso possa essere migliorata qualora si sviluppino meccanismi di feedback atti a fare si che alcuni componenti vengano usati per migliorare le prestazioni di altri componenti. A tale scopo, in questa tesi sono state indagate tecniche supervisionate e non supervisionate che si sono rivelate efficaci per aumentare l'accuratezza di un sistema nel suo complesso mediante meccanismi di feedback e di adattamento a nuovi domini, per formati di testo ambigui e rumorosi provenienti dalla piattaforma di microblogging Twitter.

(2017). Towards Adaptation of Named Entity Recognition and Linking Frameworks. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2017).