Large-scale pre-trained language models, which have learned cross-modal representations on image-text pairs, are becoming popular for vision-language tasks because the fine-tuning to a specific task enables state-of-the-art results. Existing methods require features of image regions as input, but these regions are extracted with an object detection model that does not handle overlapping, noisy and ambiguous regions; this inevitably results in less meaningful features. In this paper we propose a new way to extract region features based on image segmentation, with the goal of reducing overlapping and noise. Our method is motivated by the observation that image segmentation can remove useless pixels using the binary mask to extract only the object of interest.

Bianco, S., Ferrario, G., Napoletano, P. (2022). Image Captioning using Pretrained Language Models and Image Segmentation. In IEEE International Conference on Consumer Electronics - Berlin, ICCE-Berlin (pp.1-6). IEEE Computer Society [10.1109/ICCE-Berlin56473.2022.9937098].

Image Captioning using Pretrained Language Models and Image Segmentation

Bianco S.;Napoletano P.
2022

Abstract

Large-scale pre-trained language models, which have learned cross-modal representations on image-text pairs, are becoming popular for vision-language tasks because the fine-tuning to a specific task enables state-of-the-art results. Existing methods require features of image regions as input, but these regions are extracted with an object detection model that does not handle overlapping, noisy and ambiguous regions; this inevitably results in less meaningful features. In this paper we propose a new way to extract region features based on image segmentation, with the goal of reducing overlapping and noise. Our method is motivated by the observation that image segmentation can remove useless pixels using the binary mask to extract only the object of interest.
slide + paper
Computer Vision; Image Captioning; Image Segmentation; Large-scale Language Model; Natural Language Processing;
English
12th IEEE International Conference on Consumer Electronics, ICCE-Berlin 2022 - 2 September 2022 through 6 September 2022
2022
IEEE International Conference on Consumer Electronics - Berlin, ICCE-Berlin
978-1-6654-5676-0
2022
2022-
1
6
none
Bianco, S., Ferrario, G., Napoletano, P. (2022). Image Captioning using Pretrained Language Models and Image Segmentation. In IEEE International Conference on Consumer Electronics - Berlin, ICCE-Berlin (pp.1-6). IEEE Computer Society [10.1109/ICCE-Berlin56473.2022.9937098].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/398213
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
Social impact