Multimodal Large Language Models have achieved impressive results on text and image benchmarks, yet their capacity to ground language in 3D geometry is still largely unexplored. Existing 3D evaluations are either confined to specialised domains, such as indoor scans, or hampered by poor texture fidelity, and none allow a fair, modality-aligned comparison with the 2D counterparts. Without a rigorous benchmark, it remains unclear whether current 3D-aware models genuinely grasp shape, colour, pose, and quantity, or merely echo memorised textual priors. We address this gap with GLUE3D ( General Language Understanding Evaluation for 3D Point Clouds ), a benchmark built around 128 richly textured meshes spanning creatures, objects, architecture and transport. Each asset is provided both as a 50 k-point RGB point cloud and as a matched 512 × 512 rendering, enabling point-for-point evaluation across modalities. Over these assets we manually curate 1024 binary probes, 256 multiple-choice questions, 256 open-ended questions, and 128 caption prompts that jointly evaluate sub-entity recognition, physical state, colour attribution and counting, giving a fine-grained understanding of 3D geometry and its semantics. A comprehensive study of twelve recent systems reveals a pronounced modality gap: the image-conditioned Qwen-2.5-VL achieves 79 % accuracy on binary probes and 74 % on multiple-choice questions, while the best point-cloud model reaches only 55 % and 33 % respectively. Caption-quality assessments follow the same pattern, underscoring the gap that remains for genuine 3D understanding. We make GLUE3D publicly available, along with its evaluation scripts and baseline scores, to advance progress in geometry-aware multimodal language understanding.

Mariani, G., Raganato, A., Melzi, S., Pasi, G. (2026). GLUE3D: General language understanding evaluation for 3D point clouds. INFORMATION FUSION, 129(May 2026) [10.1016/j.inffus.2025.104007].

GLUE3D: General language understanding evaluation for 3D point clouds

Mariani G.;Raganato A.;Melzi S.;Pasi G.
2026

Abstract

Multimodal Large Language Models have achieved impressive results on text and image benchmarks, yet their capacity to ground language in 3D geometry is still largely unexplored. Existing 3D evaluations are either confined to specialised domains, such as indoor scans, or hampered by poor texture fidelity, and none allow a fair, modality-aligned comparison with the 2D counterparts. Without a rigorous benchmark, it remains unclear whether current 3D-aware models genuinely grasp shape, colour, pose, and quantity, or merely echo memorised textual priors. We address this gap with GLUE3D ( General Language Understanding Evaluation for 3D Point Clouds ), a benchmark built around 128 richly textured meshes spanning creatures, objects, architecture and transport. Each asset is provided both as a 50 k-point RGB point cloud and as a matched 512 × 512 rendering, enabling point-for-point evaluation across modalities. Over these assets we manually curate 1024 binary probes, 256 multiple-choice questions, 256 open-ended questions, and 128 caption prompts that jointly evaluate sub-entity recognition, physical state, colour attribution and counting, giving a fine-grained understanding of 3D geometry and its semantics. A comprehensive study of twelve recent systems reveals a pronounced modality gap: the image-conditioned Qwen-2.5-VL achieves 79 % accuracy on binary probes and 74 % on multiple-choice questions, while the best point-cloud model reaches only 55 % and 33 % respectively. Caption-quality assessments follow the same pattern, underscoring the gap that remains for genuine 3D understanding. We make GLUE3D publicly available, along with its evaluation scripts and baseline scores, to advance progress in geometry-aware multimodal language understanding.
Articolo in rivista - Articolo scientifico
3D understanding; Benchmark; Multimodal large language models; Point cloud;
English
29-nov-2025
2026
129
May 2026
104007
open
Mariani, G., Raganato, A., Melzi, S., Pasi, G. (2026). GLUE3D: General language understanding evaluation for 3D point clouds. INFORMATION FUSION, 129(May 2026) [10.1016/j.inffus.2025.104007].
File in questo prodotto:
File Dimensione Formato  
Mariani-2026-Information Fusion-VoR.pdf

accesso aperto

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Creative Commons
Dimensione 7.33 MB
Formato Adobe PDF
7.33 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/588397
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
Social impact