Bicocca Open Archive

Glioblastoma is an aggressive brain cancer that kills approximately one hundred thousand people worldwide every year. Unfortunately, treatment and therapy for patients with this disease are complicated and have limited efficacy in improving individuals' chances of survival. Electronic health records (EHRs) contain patient information collected routinely at hospitals through medical visits and laboratory tests, providing an interesting source of data for computational analyses. Clustering is an area of unsupervised machine learning where an algorithm partitions data according to certain statistical properties or rules, thereby identifying hidden patterns and correlations that would otherwise be difficult to notice. In this study, we applied several clustering techniques to three open datasets (Munich2019, Tainan2020, and Utrecht2019) derived from electronic health records, which included clinical, genetic, and administrative features of patients diagnosed with glioblastoma, considering two possible clusters. We evaluated our clustering results with the Density-Based Clustering Validation (DBCV) index, a relatively new score capable of accurately assessing both convex-shaped and concave-shaped clusters. Among the methods tested, Density-based Spatial Clustering of Applications with Noise (DBSCAN) yielded the best results across all three datasets. We then analyzed the features of the clusters identified by DBSCAN and found that cytosolic Hsp70 protein in the Munich2019 dataset, sex in the Tainan2020 dataset, and brain subventricular zone in the Utrecht2019 resulted significantly capable to distinguish the two clusters.

Chicco, D., Dora, S., Oneto, L. (2026). DBSCAN applied to EHRs data from patients with glioblastoma clusters patients based on cytosolic Hsp70 protein, sex, and brain subventricular zone. BIODATA MINING, 19(1) [10.1186/s13040-026-00549-x].

DBSCAN applied to EHRs data from patients with glioblastoma clusters patients based on cytosolic Hsp70 protein, sex, and brain subventricular zone

Chicco D.^Primo;Dora S.;Oneto L.

2026

Abstract

Glioblastoma is an aggressive brain cancer that kills approximately one hundred thousand people worldwide every year. Unfortunately, treatment and therapy for patients with this disease are complicated and have limited efficacy in improving individuals' chances of survival. Electronic health records (EHRs) contain patient information collected routinely at hospitals through medical visits and laboratory tests, providing an interesting source of data for computational analyses. Clustering is an area of unsupervised machine learning where an algorithm partitions data according to certain statistical properties or rules, thereby identifying hidden patterns and correlations that would otherwise be difficult to notice. In this study, we applied several clustering techniques to three open datasets (Munich2019, Tainan2020, and Utrecht2019) derived from electronic health records, which included clinical, genetic, and administrative features of patients diagnosed with glioblastoma, considering two possible clusters. We evaluated our clustering results with the Density-Based Clustering Validation (DBCV) index, a relatively new score capable of accurately assessing both convex-shaped and concave-shaped clusters. Among the methods tested, Density-based Spatial Clustering of Applications with Noise (DBSCAN) yielded the best results across all three datasets. We then analyzed the features of the clusters identified by DBSCAN and found that cytosolic Hsp70 protein in the Munich2019 dataset, sex in the Tainan2020 dataset, and brain subventricular zone in the Utrecht2019 resulted significantly capable to distinguish the two clusters.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				Clustering; EHRs; Electronic health records; Glioblastoma; Machine learning; Unsupervised machine learning;
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				27-mar-2026
			
	Data di pubblicazione
	
				2026
			
	Rivista
	
				BIODATA MINING
			
	Numero del volume
	
				19
			
	Fascicolo
	
				1
			
	Article number
	
				32
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1186/s13040-026-00549-x
			
	Fulltext
	
				open
			
	Citazione
	
				Chicco, D., Dora, S., Oneto, L. (2026). DBSCAN applied to EHRs data from patients with glioblastoma clusters patients based on cytosolic Hsp70 protein, sex, and brain subventricular zone. BIODATA MINING, 19(1) [10.1186/s13040-026-00549-x].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Chicco et al-2026-BioData Mining-VoR.pdf accesso aperto Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 1.87 MB Formato Adobe PDF Visualizza/Apri	1.87 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/611521

Citazioni

0

0

Social impact