In this contribution we propose a hierarchical fuzzy clustering algorithm for dynamically supporting information filtering. The idea is that document filtering can draw advantages from a dynamic hierarchical fuzzy clustering of the documents into overlapping topic categories corresponding with different levels of granularity of the categorisation. Users can have either general interests or specific ones depending on their profile and thus they must be feed with documents belonging to the categories of interest that can correspond with either a high level topic, such as sport news, or a subtopics, such as football news, or even a very specific topics such as football matches of their favourite team. The hierarchical structure of the automatically identified clusters is built so that each level corresponds with a distinct level of overlapping of the clusters in it, so that in climbing the hierarchy this value increases since the topics represented in the upper levels are more general, i.e., fuzzier. The hierarchy of fuzzy clusters is used to support the filtering criteria that are personalized based on user profiles. Since a filter monitors one or more continuously feed document streams, the clustering must be able both to generate a fuzzy hierarchical classification of a collection of documents and to update the hierarchy of existing categories by either including newly found documents or detecting new categories when such new documents have contents that are different from those represented by the existing clusters. The fuzzy clustering algorithm is based on a generalization of the fuzzy C-means algorithm that is iteratively applied to each hierarchical level to identify clusters of the higher level. In order to apply this algorithm in document filtering it has been extended so as to use a cosine similarity instead of the usual Euclidean distance, and to automatically estimate the number of the clusters to detect at each hierarchical level. This number is identified based either on an explicit input that specifies the minimum percentage of common index terms that the clusters of the level can share (that is equivalent to indicate a tolerance for overlapping between the topics dealt with in each fuzzy cluster) or on a statistical analysis of the cumulative curve of overlapping degrees between all pairs of clusters of the level. This way the problem of application of the fuzzy C means that requires the specification of the desired number of the clusters is overcome. © 2006 Springer-Verlag Berlin Heidelberg.

Bordogna, G., Pagani, M., Pasi, G. (2006). A dynamical Hierarchical fuzzy clustering algorithm for document filtering. In E. Herrera-Viedma, G. Pasi, F. Crestani (a cura di), Soft Computing in Web Information Retrieval (pp. 3-23). Springer [10.1007/3-540-31590-X_1].

A dynamical Hierarchical fuzzy clustering algorithm for document filtering

PASI, GABRIELLA
2006

Abstract

In this contribution we propose a hierarchical fuzzy clustering algorithm for dynamically supporting information filtering. The idea is that document filtering can draw advantages from a dynamic hierarchical fuzzy clustering of the documents into overlapping topic categories corresponding with different levels of granularity of the categorisation. Users can have either general interests or specific ones depending on their profile and thus they must be feed with documents belonging to the categories of interest that can correspond with either a high level topic, such as sport news, or a subtopics, such as football news, or even a very specific topics such as football matches of their favourite team. The hierarchical structure of the automatically identified clusters is built so that each level corresponds with a distinct level of overlapping of the clusters in it, so that in climbing the hierarchy this value increases since the topics represented in the upper levels are more general, i.e., fuzzier. The hierarchy of fuzzy clusters is used to support the filtering criteria that are personalized based on user profiles. Since a filter monitors one or more continuously feed document streams, the clustering must be able both to generate a fuzzy hierarchical classification of a collection of documents and to update the hierarchy of existing categories by either including newly found documents or detecting new categories when such new documents have contents that are different from those represented by the existing clusters. The fuzzy clustering algorithm is based on a generalization of the fuzzy C-means algorithm that is iteratively applied to each hierarchical level to identify clusters of the higher level. In order to apply this algorithm in document filtering it has been extended so as to use a cosine similarity instead of the usual Euclidean distance, and to automatically estimate the number of the clusters to detect at each hierarchical level. This number is identified based either on an explicit input that specifies the minimum percentage of common index terms that the clusters of the level can share (that is equivalent to indicate a tolerance for overlapping between the topics dealt with in each fuzzy cluster) or on a statistical analysis of the cumulative curve of overlapping degrees between all pairs of clusters of the level. This way the problem of application of the fuzzy C means that requires the specification of the desired number of the clusters is overcome. © 2006 Springer-Verlag Berlin Heidelberg.
Capitolo o saggio
dynamical, hierarchical, fuzzy, clustering, algorithm, document, filtering
English
Soft Computing in Web Information Retrieval
Herrera-Viedma, E; Pasi, G; Crestani, F
2006
9783540315889
197
Springer
3
23
Bordogna, G., Pagani, M., Pasi, G. (2006). A dynamical Hierarchical fuzzy clustering algorithm for document filtering. In E. Herrera-Viedma, G. Pasi, F. Crestani (a cura di), Soft Computing in Web Information Retrieval (pp. 3-23). Springer [10.1007/3-540-31590-X_1].
none
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/20813
Citazioni
  • Scopus 15
  • ???jsp.display-item.citation.isi??? ND
Social impact