We present a term weighting approach for improving web page classification, based on the assumption that the images of a web page are those elements which mainly attract the attention of the user. This assumption implies that the text contained in the visual block in which an image is located, called image-block, should contain significant information about the page contents. In this paper we propose a new metric, called the Inverse Term Importance Metric, aimed at assigning higher weights to important terms contained into important image-blocks identified by performing a visual layout analysis. We propose different methods to estimate the visual image-blocks importance, to smooth the term weight according to the importance of the blocks in which the term is located. The traditional TFxIDF model is modified accordingly and used in the classification task. The effectiveness of this new metric and the proposed block evaluation methods have been validated using different classification algorithms.

Fersini, E., Messina, V., & Archetti, F. (2008). Enhancing web page classification through image-block importance analysis. INFORMATION PROCESSING & MANAGEMENT, 44(4), 1431-1447 [10.1016/j.ipm.2007.11.003].

Enhancing web page classification through image-block importance analysis

FERSINI, ELISABETTA;MESSINA, VINCENZINA;ARCHETTI, FRANCESCO ANTONIO
2008

Abstract

We present a term weighting approach for improving web page classification, based on the assumption that the images of a web page are those elements which mainly attract the attention of the user. This assumption implies that the text contained in the visual block in which an image is located, called image-block, should contain significant information about the page contents. In this paper we propose a new metric, called the Inverse Term Importance Metric, aimed at assigning higher weights to important terms contained into important image-blocks identified by performing a visual layout analysis. We propose different methods to estimate the visual image-blocks importance, to smooth the term weight according to the importance of the blocks in which the term is located. The traditional TFxIDF model is modified accordingly and used in the classification task. The effectiveness of this new metric and the proposed block evaluation methods have been validated using different classification algorithms.
Articolo in rivista - Articolo scientifico
term weighting; vector space model; visual layout analysis; document classification
English
1431
1447
Fersini, E., Messina, V., & Archetti, F. (2008). Enhancing web page classification through image-block importance analysis. INFORMATION PROCESSING & MANAGEMENT, 44(4), 1431-1447 [10.1016/j.ipm.2007.11.003].
Fersini, E; Messina, V; Archetti, F
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/4241
Citazioni
  • Scopus 11
  • ???jsp.display-item.citation.isi??? 8
Social impact