Bicocca Open Archive

Multimodal Large Language Models have achieved impressive results on text and image benchmarks, yet their capacity to ground language in 3D geometry is still largely unexplored. Existing 3D evaluations are either confined to specialised domains, such as indoor scans, or hampered by poor texture fidelity, and none allow a fair, modality-aligned comparison with the 2D counterparts. Without a rigorous benchmark, it remains unclear whether current 3D-aware models genuinely grasp shape, colour, pose, and quantity, or merely echo memorised textual priors. We address this gap with GLUE3D ( General Language Understanding Evaluation for 3D Point Clouds ), a benchmark built around 128 richly textured meshes spanning creatures, objects, architecture and transport. Each asset is provided both as a 50 k-point RGB point cloud and as a matched 512 × 512 rendering, enabling point-for-point evaluation across modalities. Over these assets we manually curate 1024 binary probes, 256 multiple-choice questions, 256 open-ended questions, and 128 caption prompts that jointly evaluate sub-entity recognition, physical state, colour attribution and counting, giving a fine-grained understanding of 3D geometry and its semantics. A comprehensive study of twelve recent systems reveals a pronounced modality gap: the image-conditioned Qwen-2.5-VL achieves 79 % accuracy on binary probes and 74 % on multiple-choice questions, while the best point-cloud model reaches only 55 % and 33 % respectively. Caption-quality assessments follow the same pattern, underscoring the gap that remains for genuine 3D understanding. We make GLUE3D publicly available, along with its evaluation scripts and baseline scores, to advance progress in geometry-aware multimodal language understanding.

Mariani, G., Raganato, A., Melzi, S., Pasi, G. (2026). GLUE3D: General language understanding evaluation for 3D point clouds. INFORMATION FUSION, 129(May 2026) [10.1016/j.inffus.2025.104007].

GLUE3D: General language understanding evaluation for 3D point clouds

Mariani G.;Raganato A.;Melzi S.;Pasi G.

2026

Abstract

Multimodal Large Language Models have achieved impressive results on text and image benchmarks, yet their capacity to ground language in 3D geometry is still largely unexplored. Existing 3D evaluations are either confined to specialised domains, such as indoor scans, or hampered by poor texture fidelity, and none allow a fair, modality-aligned comparison with the 2D counterparts. Without a rigorous benchmark, it remains unclear whether current 3D-aware models genuinely grasp shape, colour, pose, and quantity, or merely echo memorised textual priors. We address this gap with GLUE3D ( General Language Understanding Evaluation for 3D Point Clouds ), a benchmark built around 128 richly textured meshes spanning creatures, objects, architecture and transport. Each asset is provided both as a 50 k-point RGB point cloud and as a matched 512 × 512 rendering, enabling point-for-point evaluation across modalities. Over these assets we manually curate 1024 binary probes, 256 multiple-choice questions, 256 open-ended questions, and 128 caption prompts that jointly evaluate sub-entity recognition, physical state, colour attribution and counting, giving a fine-grained understanding of 3D geometry and its semantics. A comprehensive study of twelve recent systems reveals a pronounced modality gap: the image-conditioned Qwen-2.5-VL achieves 79 % accuracy on binary probes and 74 % on multiple-choice questions, while the best point-cloud model reaches only 55 % and 33 % respectively. Caption-quality assessments follow the same pattern, underscoring the gap that remains for genuine 3D understanding. We make GLUE3D publicly available, along with its evaluation scripts and baseline scores, to advance progress in geometry-aware multimodal language understanding.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				3D understanding; Benchmark; Multimodal large language models; Point cloud;
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				29-nov-2025
			
	Data di pubblicazione
	
				2026
			
	Rivista
	
				INFORMATION FUSION
			
	Numero del volume
	
				129
			
	Fascicolo
	
				May 2026
			
	Article number
	
				104007
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1016/j.inffus.2025.104007
			
	Fulltext
	
				open
			
	Citazione
	
				Mariani, G., Raganato, A., Melzi, S., Pasi, G. (2026). GLUE3D: General language understanding evaluation for 3D point clouds. INFORMATION FUSION, 129(May 2026) [10.1016/j.inffus.2025.104007].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Mariani-2026-Information Fusion-VoR.pdf accesso aperto Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 7.33 MB Formato Adobe PDF Visualizza/Apri	7.33 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/588397

Citazioni

0

0

Social impact