Understanding the structure of a dataset is an easy task when the dimensions are two or three, but it can become extremely difficult when a dataset consists of tens, hundreds, or thousands of variables. Dimensionality reduction methods are computational techniques with solid mathematical foundations that allow for the projection of high-dimensional datasets into smaller data spaces. These low-dimensional representations of the original data, usually consisting of two variables, can then be plotted and inspected by researchers to gain an understanding of the original data structure. Uniform Manifold Approximation and Projection (UMAP) is one of the most effective and popular algorithms for dimensionality reduction, and has been proven effective on biomedical datasets, in particular. Even though UMAP is commonly utilized by thousands of researchers worldwide, no consensus has been reached on how to assess the output of dimensionality reduction informatively: to date, researchers often evaluate UMAP's outcomes by eyeballing its two-dimensional plots each time. Of course, this approach is rather arbitrary, as different individuals might interpret a 2D plot in a different way. Some numerical coefficients for assessing UMAP's conservation of global and local structure exist (continuity and trustworthiness, respectively), but they suffer from several flaws and can be misleading in multiple cases. To address these issues, we present here our Saturn coefficient, a new simple statistical metric that expresses the conservation of local structure and the conservation of global structure in UMAP through a real value ranging from 0 (no preservation) to 1 (complete preservation). In this study, we describe the rationale behind our Saturn coefficient and validate its results compared to continuity and trustworthiness on four artificial datasets and ten real-world biomedical datasets. Additionally, we propose a novel validation procedure based on the preservation of the clusters found by HDBSCAN (hierarchical density-based spatial clustering of applications with noise) in the original dataset within its dimensionality reduction representation (HDBSCANess). Our results demonstrate the validity of our Saturn coefficient across all artificial datasets and in seven out of fifteen real-world biomedical datasets. We therefore recommend the use of our Saturn coefficient to anyone wishing to assess UMAP results: our statistic, for example, can be used to test several sets of UMAP hyperparameters and to select the best configuration among them. Moreover, we also provide the software implementation of our Saturn coefficient as a standalone R package openly available on CRAN at https://doi.org/10.32614/CRAN.package.SaturnCoefficient. SaturnCoefficient and as a standalone Python package openly available on PyPI at https://pypi.org/project/SaturnScore.
Chicco, D., Melzi, S., Gasparini, F., Jurman, G. (2026). The advantages of our proposed Saturn coefficient over continuity and trustworthiness for UMAP dimensionality reduction evaluation. PEERJ. COMPUTER SCIENCE., 12 [10.7717/peerj-cs.3424].
The advantages of our proposed Saturn coefficient over continuity and trustworthiness for UMAP dimensionality reduction evaluation
Chicco D.Primo
;Melzi S.;Gasparini F.;
2026
Abstract
Understanding the structure of a dataset is an easy task when the dimensions are two or three, but it can become extremely difficult when a dataset consists of tens, hundreds, or thousands of variables. Dimensionality reduction methods are computational techniques with solid mathematical foundations that allow for the projection of high-dimensional datasets into smaller data spaces. These low-dimensional representations of the original data, usually consisting of two variables, can then be plotted and inspected by researchers to gain an understanding of the original data structure. Uniform Manifold Approximation and Projection (UMAP) is one of the most effective and popular algorithms for dimensionality reduction, and has been proven effective on biomedical datasets, in particular. Even though UMAP is commonly utilized by thousands of researchers worldwide, no consensus has been reached on how to assess the output of dimensionality reduction informatively: to date, researchers often evaluate UMAP's outcomes by eyeballing its two-dimensional plots each time. Of course, this approach is rather arbitrary, as different individuals might interpret a 2D plot in a different way. Some numerical coefficients for assessing UMAP's conservation of global and local structure exist (continuity and trustworthiness, respectively), but they suffer from several flaws and can be misleading in multiple cases. To address these issues, we present here our Saturn coefficient, a new simple statistical metric that expresses the conservation of local structure and the conservation of global structure in UMAP through a real value ranging from 0 (no preservation) to 1 (complete preservation). In this study, we describe the rationale behind our Saturn coefficient and validate its results compared to continuity and trustworthiness on four artificial datasets and ten real-world biomedical datasets. Additionally, we propose a novel validation procedure based on the preservation of the clusters found by HDBSCAN (hierarchical density-based spatial clustering of applications with noise) in the original dataset within its dimensionality reduction representation (HDBSCANess). Our results demonstrate the validity of our Saturn coefficient across all artificial datasets and in seven out of fifteen real-world biomedical datasets. We therefore recommend the use of our Saturn coefficient to anyone wishing to assess UMAP results: our statistic, for example, can be used to test several sets of UMAP hyperparameters and to select the best configuration among them. Moreover, we also provide the software implementation of our Saturn coefficient as a standalone R package openly available on CRAN at https://doi.org/10.32614/CRAN.package.SaturnCoefficient. SaturnCoefficient and as a standalone Python package openly available on PyPI at https://pypi.org/project/SaturnScore.| File | Dimensione | Formato | |
|---|---|---|---|
|
Chicco et al-2026-PeerJ Computer Science-VoR.pdf
accesso aperto
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Licenza:
Creative Commons
Dimensione
6.37 MB
Formato
Adobe PDF
|
6.37 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


