Bicocca Open Archive

In this contribution we propose a hierarchical fuzzy clustering algorithm for dynamically supporting information filtering. The idea is that document filtering can draw advantages from a dynamic hierarchical fuzzy clustering of the documents into overlapping topic categories corresponding with different levels of granularity of the categorisation. Users can have either general interests or specific ones depending on their profile and thus they must be feed with documents belonging to the categories of interest that can correspond with either a high level topic, such as sport news, or a subtopics, such as football news, or even a very specific topics such as football matches of their favourite team. The hierarchical structure of the automatically identified clusters is built so that each level corresponds with a distinct level of overlapping of the clusters in it, so that in climbing the hierarchy this value increases since the topics represented in the upper levels are more general, i.e., fuzzier. The hierarchy of fuzzy clusters is used to support the filtering criteria that are personalized based on user profiles. Since a filter monitors one or more continuously feed document streams, the clustering must be able both to generate a fuzzy hierarchical classification of a collection of documents and to update the hierarchy of existing categories by either including newly found documents or detecting new categories when such new documents have contents that are different from those represented by the existing clusters. The fuzzy clustering algorithm is based on a generalization of the fuzzy C-means algorithm that is iteratively applied to each hierarchical level to identify clusters of the higher level. In order to apply this algorithm in document filtering it has been extended so as to use a cosine similarity instead of the usual Euclidean distance, and to automatically estimate the number of the clusters to detect at each hierarchical level. This number is identified based either on an explicit input that specifies the minimum percentage of common index terms that the clusters of the level can share (that is equivalent to indicate a tolerance for overlapping between the topics dealt with in each fuzzy cluster) or on a statistical analysis of the cumulative curve of overlapping degrees between all pairs of clusters of the level. This way the problem of application of the fuzzy C means that requires the specification of the desired number of the clusters is overcome. © 2006 Springer-Verlag Berlin Heidelberg.

Bordogna, G., Pagani, M., Pasi, G. (2006). A dynamical Hierarchical fuzzy clustering algorithm for document filtering. In E. Herrera-Viedma, G. Pasi, F. Crestani (a cura di), Soft Computing in Web Information Retrieval (pp. 3-23). Springer [10.1007/3-540-31590-X_1].

A dynamical Hierarchical fuzzy clustering algorithm for document filtering

Bordogna, G;Pagani, M;PASI, GABRIELLA

2006

Abstract

In this contribution we propose a hierarchical fuzzy clustering algorithm for dynamically supporting information filtering. The idea is that document filtering can draw advantages from a dynamic hierarchical fuzzy clustering of the documents into overlapping topic categories corresponding with different levels of granularity of the categorisation. Users can have either general interests or specific ones depending on their profile and thus they must be feed with documents belonging to the categories of interest that can correspond with either a high level topic, such as sport news, or a subtopics, such as football news, or even a very specific topics such as football matches of their favourite team. The hierarchical structure of the automatically identified clusters is built so that each level corresponds with a distinct level of overlapping of the clusters in it, so that in climbing the hierarchy this value increases since the topics represented in the upper levels are more general, i.e., fuzzier. The hierarchy of fuzzy clusters is used to support the filtering criteria that are personalized based on user profiles. Since a filter monitors one or more continuously feed document streams, the clustering must be able both to generate a fuzzy hierarchical classification of a collection of documents and to update the hierarchy of existing categories by either including newly found documents or detecting new categories when such new documents have contents that are different from those represented by the existing clusters. The fuzzy clustering algorithm is based on a generalization of the fuzzy C-means algorithm that is iteratively applied to each hierarchical level to identify clusters of the higher level. In order to apply this algorithm in document filtering it has been extended so as to use a cosine similarity instead of the usual Euclidean distance, and to automatically estimate the number of the clusters to detect at each hierarchical level. This number is identified based either on an explicit input that specifies the minimum percentage of common index terms that the clusters of the level can share (that is equivalent to indicate a tolerance for overlapping between the topics dealt with in each fuzzy cluster) or on a statistical analysis of the cumulative curve of overlapping degrees between all pairs of clusters of the level. This way the problem of application of the fuzzy C means that requires the specification of the desired number of the clusters is overcome. © 2006 Springer-Verlag Berlin Heidelberg.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Capitolo o saggio
			
	Parole chiave
	
				dynamical, hierarchical, fuzzy, clustering, algorithm, document, filtering
			
	Lingua del contenuto
	
				English
			
	Titolo del volume
	
				Soft Computing in Web Information Retrieval
			
	Curatori del volume
	
				Herrera-Viedma, E; Pasi, G; Crestani, F
			
	Data di pubblicazione
	
				2006
			
	ISBN del volume
	
				9783540315889
			
	Collana o serie
	
				STUDIES IN FUZZINESS AND SOFT COMPUTING
			
	Numero del volume
	
				197
			
	Editore
	
				Springer
			
	Pagina iniziale
	
				3
			
	Pagina finale
	
				23
			
	DOI del contributo
	
				https://dx.doi.org/10.1007/3-540-31590-X_1
			
	Citazione
	
				Bordogna, G., Pagani, M., Pasi, G. (2006). A dynamical Hierarchical fuzzy clustering algorithm for document filtering. In E. Herrera-Viedma, G. Pasi, F. Crestani (a cura di), Soft Computing in Web Information Retrieval (pp. 3-23). Springer [10.1007/3-540-31590-X_1].
			
	Fulltext
	
				none
			
	Appare nelle tipologie:
	
				03 - Contributo in libro

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/20813

Citazioni

16

ND

Social impact