Large-scale pre-trained language models, which have learned cross-modal representations on image-text pairs, are becoming popular for vision-language tasks because the fine-tuning to a specific task enables state-of-the-art results. Existing methods require features of image regions as input, but these regions are extracted with an object detection model that does not handle overlapping, noisy and ambiguous regions; this inevitably results in less meaningful features. In this paper we propose a new way to extract region features based on image segmentation, with the goal of reducing overlapping and noise. Our method is motivated by the observation that image segmentation can remove useless pixels using the binary mask to extract only the object of interest.
Bianco, S., Ferrario, G., Napoletano, P. (2022). Image Captioning using Pretrained Language Models and Image Segmentation. In IEEE International Conference on Consumer Electronics - Berlin, ICCE-Berlin (pp.1-6). IEEE Computer Society [10.1109/ICCE-Berlin56473.2022.9937098].
Image Captioning using Pretrained Language Models and Image Segmentation
Bianco S.;Napoletano P.
2022
Abstract
Large-scale pre-trained language models, which have learned cross-modal representations on image-text pairs, are becoming popular for vision-language tasks because the fine-tuning to a specific task enables state-of-the-art results. Existing methods require features of image regions as input, but these regions are extracted with an object detection model that does not handle overlapping, noisy and ambiguous regions; this inevitably results in less meaningful features. In this paper we propose a new way to extract region features based on image segmentation, with the goal of reducing overlapping and noise. Our method is motivated by the observation that image segmentation can remove useless pixels using the binary mask to extract only the object of interest.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.