The Gap Statistic can be misleading when used to evaluate near box shaped clusters in the Euclidean space

Merigo, E; Anfossi, A; Chicco, D

doi:10.1007/s44163-026-01195-2

The Gap Statistic is a metric designed and employed for the internal assessment of results from clustering analyses. Despite its popularity, we noticed a series of unexpected behaviors of this coefficient in some specific contexts. We therefore designed this study to understand why the Gap Statistic can take on negative values and under what circumstances this occurs. To this end, we introduce the concept of cages (box-shaped rectangular clusters in the Euclidean space), and calculate the Gap Statistic on the results obtained by k-Means applied to them, using the R open source programming language. We provide a mathematical explanation of how rectangular clusters were used and the reasoning behind their choice, starting with the original formula for the Gap Statistic. The results we obtained are inconsistent with the interpretation of the Gap Statistic, which suggests that a negative value indicates overlapping clusters or the presence of outliers around them. In contrast, we implemented well-separated groupings with no data points in between and still obtained a negative value for this metric. Considering these results, the Gap Statistic cannot be considered a reliable and standard assessment score for clustering experiments, as its resultant value alone does not provide a clear and universal understanding of the data distribution, in some specific cases. In fact, negative values of the statistic may arise in well-separated rectangular clusters closed to each other, reflecting sensitivity to reference distribution geometry. We therefore advise readers to avoid placing trust in the Gap Statistic when it yields negative values, to avoid employing the Gap Statistic alone but rather using it alongside more reliable metrics, such as Silhouette coefficient, Davies-Bouldin index, and DBCV index.

Merigo, E., Anfossi, A., Chicco, D. (2026). The Gap Statistic can be misleading when used to evaluate near box shaped clusters in the Euclidean space. DISCOVER ARTIFICIAL INTELLIGENCE, 6(1) [10.1007/s44163-026-01195-2].

The Gap Statistic can be misleading when used to evaluate near box shaped clusters in the Euclidean space

Merigo E. M.;Anfossi A.;Chicco D.^Ultimo

2026

Abstract

The Gap Statistic is a metric designed and employed for the internal assessment of results from clustering analyses. Despite its popularity, we noticed a series of unexpected behaviors of this coefficient in some specific contexts. We therefore designed this study to understand why the Gap Statistic can take on negative values and under what circumstances this occurs. To this end, we introduce the concept of cages (box-shaped rectangular clusters in the Euclidean space), and calculate the Gap Statistic on the results obtained by k-Means applied to them, using the R open source programming language. We provide a mathematical explanation of how rectangular clusters were used and the reasoning behind their choice, starting with the original formula for the Gap Statistic. The results we obtained are inconsistent with the interpretation of the Gap Statistic, which suggests that a negative value indicates overlapping clusters or the presence of outliers around them. In contrast, we implemented well-separated groupings with no data points in between and still obtained a negative value for this metric. Considering these results, the Gap Statistic cannot be considered a reliable and standard assessment score for clustering experiments, as its resultant value alone does not provide a clear and universal understanding of the data distribution, in some specific cases. In fact, negative values of the statistic may arise in well-separated rectangular clusters closed to each other, reflecting sensitivity to reference distribution geometry. We therefore advise readers to avoid placing trust in the Gap Statistic when it yields negative values, to avoid employing the Gap Statistic alone but rather using it alongside more reliable metrics, such as Silhouette coefficient, Davies-Bouldin index, and DBCV index.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				Clustering; Clustering internal results evaluation; Gap Statistic; Internal clustering metrics; Unsupervised machine learning;
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				15-apr-2026
			
	Data di pubblicazione
	
				2026
			
	Rivista
	
				DISCOVER ARTIFICIAL INTELLIGENCE
			
	Numero del volume
	
				6
			
	Fascicolo
	
				1
			
	Article number
	
				334
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1007/s44163-026-01195-2
			
	Fulltext
	
				open
			
	Citazione
	
				Merigo, E., Anfossi, A., Chicco, D. (2026). The Gap Statistic can be misleading when used to evaluate near box shaped clusters in the Euclidean space. DISCOVER ARTIFICIAL INTELLIGENCE, 6(1) [10.1007/s44163-026-01195-2].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Merigo et al-2026-Discov Artif Intell-VoR.pdf accesso aperto Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 1.76 MB Formato Adobe PDF Visualizza/Apri	1.76 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/604925

Citazioni

0

ND

Bicocca Open Archive

The Gap Statistic can be misleading when used to evaluate near box shaped clusters in the Euclidean space

Merigo E. M.;Anfossi A.;Chicco D.^Ultimo

Ultimo

2026

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

Social impact

Bicocca Open Archive

The Gap Statistic can be misleading when used to evaluate near box shaped clusters in the Euclidean space

Merigo E. M.;Anfossi A.;Chicco D. Ultimo

Ultimo

2026

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Citazioni

Social impact

Conferma cancellazione

Merigo E. M.;Anfossi A.;Chicco D.^Ultimo

Scheda breve

Scheda completa

Scheda completa (DC)