Integrating Concepts and Knowledge in Large Content Networks

Rossetti, M; ARCELLI FONTANA, F; Pareschi, R; Stella, F

doi:10.1007/s00354-014-0407-4

There has been in the past years a widespread effort in the computer science community to extend the vast information existing on the Internet with “semantic structure” , in the form of ontologies and their associated inference systems, that can be exploited to support complex processes and services as can be carried out by dedicated software agents, as well as the structured querying of information as in database management systems. The Semantic Web [2] and its evolution into the Web of Data [3] represent the most noticeable contributions to this trend. At the core of these approaches there is the idea of replacing the simple Web publishing that has characterized the growth of the WWW since its inception with a more sophisticated form of publishing where, beside the normal content, ontologies and other semantic information is added as meta-content. The theoretical merits of this enterprise notwithstanding, its practical implementation still meets considerable hurdles as identified, for instance, by Halevy et al. [4] : namely, cost of ontology writing and maintenance, technical challenges of Semantic Web publishing, eventual unwillingness of the communities involved in the evolution of an ontology to cooperate to such an evolution, lack of resilience of standard inference systems in the coping with the possibility of mistaken premises and of incorrect data. Our purpose here is to fulfill the objective of a Web-based information space that becomes more usable and meaningful as a consequence of making explicit its underlying semantic structure, but we do so by reversing the direction taken by such frameworks as the Semantic Web and the Web of Data. Indeed, we stick to the general criterion, proposed and argued by Halevi et al. [4] , according to which semantic interpretation, based on techniques derived from artificial intelligence and statistical inference, is easy compared to the daunting task of semantic publishing. Thus, we make a kind of semantic structure dynamically emerge from the available unstructured information, as opposed to superimpose it in a pre-defined format. In this we follow also the lines of the program described in Arcelli Fontana et al. [1], which aims at generating and discovering concepts on content-based networks, such as the Web, by abstracting and integrating information existing within independent units (such as Web pages, documents and other kinds of textual objects) distributed across a content network. The main tools that we use in order to implement our approach are taken from the area of Probabiilistic Topic Modeling [6]. Techniques from this area provide a sound background for developing an effective methodology for extracting concepts in the form of topics from existing information networks and then connecting them into new networks that highlight the semantic structure and relationships of the thus interpreted information. Indeed, the two main contributions of our methodology are in the construction of two networks, that are overlaid one on the other: - A higher-level topic-topic network where topics are related in terms of their semantic proximity; - A finer-grained object-object network where the textual objects that make up the topics and account for the relationships among them are analyzed in terms of their specific relationships, with links that trespass topic boundaries, again using semantic proximity as a criterion for establishing object-object links. This methodology has been tested through “knowledge sets” that have been obtained from the public YAGO2 knowledge base with the following approach: starting from specific thematic sectors of the knowledge base, we have created a number of corresponding knowledge kernels in terms of the Wikipedia pages associated with the chosen sectors, and we have then expanded such kernels by following their Web hyperlinks until suitable sets for experimentation have been created. In this way we have been able to carry out experiments, that will be here described, on knowledge sets for “Green Economy”, “Touchscreen Mobile Phones”, “City Cars” and “Terrorism”. From the practical point of view of exploring and analyzing the various knowledge sets, this has had the result of highlighting, for instance, the diverse relationships between the domain of terrorism with the domain of public transportation networks, where many terroristic attempts have been, more or less successfully, carried out. Clearly, aside from the their conceptual clustering through the creation of topic-topic networks, the effect of the creation of object-object networks on the Web pages that constitute the various knowledge sets is of connecting through "semantic links" Web pages that were previously unlinked through the existing hyperlinks. Given that adding links on a semantic basis to formerly unlinked Web objects is also one of the goals of the Web of Data, we show how our approach might contribute for all practical purposes to attaining this goal, even if through different methodological and technical premises. We also highlight differences and relationships with work on community detection in complex networks and systems, as exemplified by the well-known Newman-Girvan algorithm [5].

Rossetti, M., ARCELLI FONTANA, F., Pareschi, R., Stella, F. (2014). Integrating Concepts and Knowledge in Large Content Networks. NEW GENERATION COMPUTING, 32(3-4), 309-330 [10.1007/s00354-014-0407-4].