One of the most important tasks in Information Retrieval (IR) is related to web page information extraction and processing. It is a common approach to consider a web page as an atomic unit and to model its textual content as a "bag-of-words". However, this kind of representation does not reflect how people perceive a web page. A granular document representation, in terms of semantic objects, can help in identifying semantic areas of a web page and using them for different IR goals. In this paper we use a granular representation to define a new metric for evaluating semantic object importance and to enhance the performance of IR systems. In particular we show that this new metric can be used not only for classification goals, in which instances are assumed as independent and identically distributed, but also to gauge the strength of relationship between hypertextual documents and exploit this information for improving page ranking performance
Fersini, E., Messina, V., Archetti, F. (2008). Granular modeling of web document: impact on information retrieval systems. In Proceeding of the 10th ACM workshop on Web information and data management– WIDM 2008 (pp.111-124). Napa Valley, California, USA : ACM [10.1145/1458502.1458520].
Granular modeling of web document: impact on information retrieval systems
FERSINI, ELISABETTAPrimo
;MESSINA, VINCENZINASecondo
;ARCHETTI, FRANCESCO ANTONIOUltimo
2008
Abstract
One of the most important tasks in Information Retrieval (IR) is related to web page information extraction and processing. It is a common approach to consider a web page as an atomic unit and to model its textual content as a "bag-of-words". However, this kind of representation does not reflect how people perceive a web page. A granular document representation, in terms of semantic objects, can help in identifying semantic areas of a web page and using them for different IR goals. In this paper we use a granular representation to define a new metric for evaluating semantic object importance and to enhance the performance of IR systems. In particular we show that this new metric can be used not only for classification goals, in which instances are assumed as independent and identically distributed, but also to gauge the strength of relationship between hypertextual documents and exploit this information for improving page ranking performanceI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.