Automated complexity estimation can be used for efficiently analyzing large image datasets, improving image compression, and enhancing tasks like image recognition, segmentation, and crowd counting. However, traditional methods often lack integration flexibility for broader applications as image complexity estimation is carried out with single-use, ad-hoc, models. To mitigate this problem, we propose to exploit the DINO-v2 self-supervised learning vision transformer model with great generalization capabilities, as a backbone for image feature extraction. This is the first work that leverage and evaluate different features extracted from a foundation model for image complexity estimation. Here, we study two kind of features that can be leveraged in different tasks: global features (context information) in the form of a single complexity score, and local features (detailed information) in the form of a pixel-wise complexity map. The features are extracted from separate branches specifically incorporated into the model. In model training, we demonstrate that a criterion based on linear and rank correlation between predictions and labels outperforms the more commonly used MSE. Our model performs comparably to state-of-the-art methods in both intra- and cross-dataset experiments. By using a pre-trained model we simplified the training process of our method and we demonstrate that the general-purpose features of DINO-v2 can be effectively used for complexity estimation.
Celona, L., Ciocca, G., Schettini, R. (2026). Leveraging foundation model DINO-v2 for image complexity estimation. NEURAL COMPUTING & APPLICATIONS, 38(1) [10.1007/s00521-025-11786-2].
Leveraging foundation model DINO-v2 for image complexity estimation
Celona, Luigi
;Ciocca, Gianluigi;Schettini, Raimondo
2026
Abstract
Automated complexity estimation can be used for efficiently analyzing large image datasets, improving image compression, and enhancing tasks like image recognition, segmentation, and crowd counting. However, traditional methods often lack integration flexibility for broader applications as image complexity estimation is carried out with single-use, ad-hoc, models. To mitigate this problem, we propose to exploit the DINO-v2 self-supervised learning vision transformer model with great generalization capabilities, as a backbone for image feature extraction. This is the first work that leverage and evaluate different features extracted from a foundation model for image complexity estimation. Here, we study two kind of features that can be leveraged in different tasks: global features (context information) in the form of a single complexity score, and local features (detailed information) in the form of a pixel-wise complexity map. The features are extracted from separate branches specifically incorporated into the model. In model training, we demonstrate that a criterion based on linear and rank correlation between predictions and labels outperforms the more commonly used MSE. Our model performs comparably to state-of-the-art methods in both intra- and cross-dataset experiments. By using a pre-trained model we simplified the training process of our method and we demonstrate that the general-purpose features of DINO-v2 can be effectively used for complexity estimation.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


