The Gap Statistic is a metric designed and employed for the internal assessment of results from clustering analyses. Despite its popularity, we noticed a series of unexpected behaviors of this coefficient in some specific contexts. We therefore designed this study to understand why the Gap Statistic can take on negative values and under what circumstances this occurs. To this end, we introduce the concept of cages (box-shaped rectangular clusters in the Euclidean space), and calculate the Gap Statistic on the results obtained by k-Means applied to them, using the R open source programming language. We provide a mathematical explanation of how rectangular clusters were used and the reasoning behind their choice, starting with the original formula for the Gap Statistic. The results we obtained are inconsistent with the interpretation of the Gap Statistic, which suggests that a negative value indicates overlapping clusters or the presence of outliers around them. In contrast, we implemented well-separated groupings with no data points in between and still obtained a negative value for this metric. Considering these results, the Gap Statistic cannot be considered a reliable and standard assessment score for clustering experiments, as its resultant value alone does not provide a clear and universal understanding of the data distribution, in some specific cases. In fact, negative values of the statistic may arise in well-separated rectangular clusters closed to each other, reflecting sensitivity to reference distribution geometry. We therefore advise readers to avoid placing trust in the Gap Statistic when it yields negative values, to avoid employing the Gap Statistic alone but rather using it alongside more reliable metrics, such as Silhouette coefficient, Davies-Bouldin index, and DBCV index.
Merigo, E., Anfossi, A., Chicco, D. (2026). The Gap Statistic can be misleading when used to evaluate near box shaped clusters in the Euclidean space. DISCOVER ARTIFICIAL INTELLIGENCE, 6(1) [10.1007/s44163-026-01195-2].
The Gap Statistic can be misleading when used to evaluate near box shaped clusters in the Euclidean space
Chicco D.
Ultimo
2026
Abstract
The Gap Statistic is a metric designed and employed for the internal assessment of results from clustering analyses. Despite its popularity, we noticed a series of unexpected behaviors of this coefficient in some specific contexts. We therefore designed this study to understand why the Gap Statistic can take on negative values and under what circumstances this occurs. To this end, we introduce the concept of cages (box-shaped rectangular clusters in the Euclidean space), and calculate the Gap Statistic on the results obtained by k-Means applied to them, using the R open source programming language. We provide a mathematical explanation of how rectangular clusters were used and the reasoning behind their choice, starting with the original formula for the Gap Statistic. The results we obtained are inconsistent with the interpretation of the Gap Statistic, which suggests that a negative value indicates overlapping clusters or the presence of outliers around them. In contrast, we implemented well-separated groupings with no data points in between and still obtained a negative value for this metric. Considering these results, the Gap Statistic cannot be considered a reliable and standard assessment score for clustering experiments, as its resultant value alone does not provide a clear and universal understanding of the data distribution, in some specific cases. In fact, negative values of the statistic may arise in well-separated rectangular clusters closed to each other, reflecting sensitivity to reference distribution geometry. We therefore advise readers to avoid placing trust in the Gap Statistic when it yields negative values, to avoid employing the Gap Statistic alone but rather using it alongside more reliable metrics, such as Silhouette coefficient, Davies-Bouldin index, and DBCV index.| File | Dimensione | Formato | |
|---|---|---|---|
|
Merigo et al-2026-Discov Artif Intell-VoR.pdf
accesso aperto
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Licenza:
Creative Commons
Dimensione
1.76 MB
Formato
Adobe PDF
|
1.76 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


