Enhanced XML Retrieval with Flexible Constraints Evaluation

Panzeri, E

Since its standardization by the World Wide Web Consortium (W3C) in 1998, the XML (acronym for eXtensible Markup Language) has been acknowledged as the de-facto standard format for data, besides being a data format employed by a wide and increasing number of application domains. XML allows data and textual contents to be structured; the structural elements are specified in plain text using strings of characters that can be easily read by computer programs, while maintaining human-readability. XPath and XQuery represent the two main standard languages that have been defined to inquire XML data; the two languages allow to select a subset of elements from an XML document, and to further manipulate its contents and to restructure the document tree form. Both XPath and XQuery are based on a Database perspective of XML documents, where the evaluation of the query clauses is performed like in the database query language SQL, from which both the XML languages took inspiration. The data-centric perspective adopted by the XQuery and XPath languages has been recently extended by an Information Retrieval oriented approach, where a new set of content-based constraints have been defined that allow a full-text search in an IR-style, with an element relevance scoring computation. This extension is called XQuery/XPath Full-Text and has been standardized by the W3C. In the Information Retrieval community other approaches have appeared that take into account the document structure and propose a set of approximate structural matching techniques, where the standard XQuery and XPath structural constraints are evaluated by path relaxation algorithms. Such approaches, however, do not offer the user the possibility to express vague structural constraints the approximate evaluation of which produces a set of weighted fragments, where the weight express the relevance of the fragment with respect to the structural constraints. This thesis describes the definition and the implementation of a formal XQuery Full-Text extension named FleXy, aimed at taking into account the user perspective in the formulation of structure-based constraints, where vagueness can be associated to the specification of such constraints. FleXy has been designed as an extension of the XQuery Full-Text language to inherit both the full-text search features from the Full-Text extension, and the standard element selection provided by XQuery. The evaluation of two new vague structural constraints defined in the FleXy language, named Below and Near, produces a set of weighted elements, where a structural-score is computed by taking into account the distance from the user required target element and the actually retrieved one. Thresholds variants of the Below and Near constraints have also been defined which allow to specify the extent of the application of the vague structural constraints. The formal definition of the FleXy language is here provided through its syntax, its semantics, and the algorithms that define the Below and the Nnear axes. The language implementation has been performed on top of an Open Source XQuery engine named BaseX, a fully featured XQuery and XPath engine with a complete adherence to the Full-Text language specification. Performance evaluations have been subsequently provided to compare the FleXy constraints with the standard XQuery counterparts, when available. Finally, a patent search application has been developed by leveraging the FleXy implementation provided on top of the BaseX engine: the XML structure of the US Patent Collection (USPTO) has been exploited in conjunction with the textual contents of the patents to help non-expert users to effectively retrieve relevant patents by also offering a result categorization strategy.

(2014). Enhanced XML Retrieval with Flexible Constraints Evaluation. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2014).