The Gap Statistic is a metric designed and employed for the internal assessment of results from clustering analyses. Despite its popularity, we noticed a series of unexpected behaviors of this coefficient in some specific contexts. We therefore designed this study to understand why the Gap Statistic can take on negative values and under what circumstances this occurs. To this end, we introduce the concept of cages (box-shaped rectangular clusters in the Euclidean space), and calculate the Gap Statistic on the results obtained by k-Means applied to them, using the R open source programming language. We provide a mathematical explanation of how rectangular clusters were used and the reasoning behind their choice, starting with the original formula for the Gap Statistic. The results we obtained are inconsistent with the interpretation of the Gap Statistic, which suggests that a negative value indicates overlapping clusters or the presence of outliers around them. In contrast, we implemented well-separated groupings with no data points in between and still obtained a negative value for this metric. Considering these results, the Gap Statistic cannot be considered a reliable and standard assessment score for clustering experiments, as its resultant value alone does not provide a clear and universal understanding of the data distribution, in some specific cases. In fact, negative values of the statistic may arise in well-separated rectangular clusters closed to each other, reflecting sensitivity to reference distribution geometry. We therefore advise readers to avoid placing trust in the Gap Statistic when it yields negative values, to avoid employing the Gap Statistic alone but rather using it alongside more reliable metrics, such as Silhouette coefficient, Davies-Bouldin index, and DBCV index.

Merigo, E., Anfossi, A., Chicco, D. (2026). The Gap Statistic can be misleading when used to evaluate near box shaped clusters in the Euclidean space. DISCOVER ARTIFICIAL INTELLIGENCE, 6(1) [10.1007/s44163-026-01195-2].

The Gap Statistic can be misleading when used to evaluate near box shaped clusters in the Euclidean space

Chicco D.
Ultimo
2026

Abstract

The Gap Statistic is a metric designed and employed for the internal assessment of results from clustering analyses. Despite its popularity, we noticed a series of unexpected behaviors of this coefficient in some specific contexts. We therefore designed this study to understand why the Gap Statistic can take on negative values and under what circumstances this occurs. To this end, we introduce the concept of cages (box-shaped rectangular clusters in the Euclidean space), and calculate the Gap Statistic on the results obtained by k-Means applied to them, using the R open source programming language. We provide a mathematical explanation of how rectangular clusters were used and the reasoning behind their choice, starting with the original formula for the Gap Statistic. The results we obtained are inconsistent with the interpretation of the Gap Statistic, which suggests that a negative value indicates overlapping clusters or the presence of outliers around them. In contrast, we implemented well-separated groupings with no data points in between and still obtained a negative value for this metric. Considering these results, the Gap Statistic cannot be considered a reliable and standard assessment score for clustering experiments, as its resultant value alone does not provide a clear and universal understanding of the data distribution, in some specific cases. In fact, negative values of the statistic may arise in well-separated rectangular clusters closed to each other, reflecting sensitivity to reference distribution geometry. We therefore advise readers to avoid placing trust in the Gap Statistic when it yields negative values, to avoid employing the Gap Statistic alone but rather using it alongside more reliable metrics, such as Silhouette coefficient, Davies-Bouldin index, and DBCV index.
Articolo in rivista - Articolo scientifico
Clustering; Clustering internal results evaluation; Gap Statistic; Internal clustering metrics; Unsupervised machine learning;
English
15-apr-2026
2026
6
1
334
open
Merigo, E., Anfossi, A., Chicco, D. (2026). The Gap Statistic can be misleading when used to evaluate near box shaped clusters in the Euclidean space. DISCOVER ARTIFICIAL INTELLIGENCE, 6(1) [10.1007/s44163-026-01195-2].
File in questo prodotto:
File Dimensione Formato  
Merigo et al-2026-Discov Artif Intell-VoR.pdf

accesso aperto

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Creative Commons
Dimensione 1.76 MB
Formato Adobe PDF
1.76 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/604925
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
Social impact