In data mining, neighborhood classifiers are valid not only for numeric data but also symbolic data. The key issue for a neighborhood classifier is how to measure the similarity between two instances. In this paper, we compare six similarity measures, Overlap, Eskin, occurrence frequency (OF), inverse OF (IOF), Goodall3, and Goodall4, for symbolic data under the framework of a covering-based neighborhood classifier. In the training stage, a covering of the universe is built based on the given similarity measure. Then a covering reduction algorithm is used to remove some of these covering blocks and determine the representatives. In the testing stage, the similarities between all unlabeled instances and representatives are computed. The closest representative or a few representatives determine the predicted class label of the unlabeled instance. We compared the six similarity measures in experiments on 15 University of California-Irvine (UCI) datasets. The results demonstrate that although no measure dominated the others in all scenarios, some measures had consistently high performance. The covering-based neighborhood classifier with appropriate similarity measures, such as Overlap, IOF, and OF, was better than ID3, C4.5, and the Naïve Bayes classifiers.
Liu, F., Zhang, B., Ciucci, D., Wu, W., Min, F. (2018). A comparison study of similarity measures for covering-based neighborhood classifiers. INFORMATION SCIENCES, 448-449, 1-17 [10.1016/j.ins.2018.03.030].
A comparison study of similarity measures for covering-based neighborhood classifiers
Ciucci, Davide;
2018
Abstract
In data mining, neighborhood classifiers are valid not only for numeric data but also symbolic data. The key issue for a neighborhood classifier is how to measure the similarity between two instances. In this paper, we compare six similarity measures, Overlap, Eskin, occurrence frequency (OF), inverse OF (IOF), Goodall3, and Goodall4, for symbolic data under the framework of a covering-based neighborhood classifier. In the training stage, a covering of the universe is built based on the given similarity measure. Then a covering reduction algorithm is used to remove some of these covering blocks and determine the representatives. In the testing stage, the similarities between all unlabeled instances and representatives are computed. The closest representative or a few representatives determine the predicted class label of the unlabeled instance. We compared the six similarity measures in experiments on 15 University of California-Irvine (UCI) datasets. The results demonstrate that although no measure dominated the others in all scenarios, some measures had consistently high performance. The covering-based neighborhood classifier with appropriate similarity measures, such as Overlap, IOF, and OF, was better than ID3, C4.5, and the Naïve Bayes classifiers.File | Dimensione | Formato | |
---|---|---|---|
Liu-2018-Informat Sci-VoR.pdf
Solo gestori archivio
Descrizione: Research Article
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Licenza:
Tutti i diritti riservati
Dimensione
1.78 MB
Formato
Adobe PDF
|
1.78 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Liu-2018-Informat Sci-AAM.pdf
accesso aperto
Descrizione: Research Article
Tipologia di allegato:
Author’s Accepted Manuscript, AAM (Post-print)
Licenza:
Creative Commons
Dimensione
990.44 kB
Formato
Adobe PDF
|
990.44 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.