Video memorability prediction aims to quantify how much a given video content will be remembered over time. The main attributes affecting the prediction of memorability are not yet fully understood and many of the methods in the literature are based on features extracted from content recognition models. In this paper we demonstrate that features extracted from a model trained with natural language supervision are effective for estimating video memorability. The proposed method exploits a Vision Transformer pretrained using Contrastive Language-Image Pretraining (CLIP) for encoding video frames. A temporal attention mechanism is then used to select and aggregate relevant frame representations into a video-level feature vector. Finally, a multi-layer perceptron maps the video-level features into a score. We test several types of encoding and temporal aggregation modules and submit our best solution to the MediaEval 2022 Predicting Media Memorability task. We achieve a correlation of 0.707 in subtask 1 (i.e. the Memento10k dataset). In task 2 we obtain a Pearson correlation of 0.487 by training on Memento10k and testing on videoMem and of 0.529 by training on videoMem and testing on Memento10k.

Agarla, M., Celona, L., Schettini, R. (In corso di stampa). Predicting Video Memorability Using a Model Pretrained with Natural Language Supervision. Intervento presentato a: MediaEval 2022: Multimedia Evaluation Workshop, Bergen, Norvegia.

Predicting Video Memorability Using a Model Pretrained with Natural Language Supervision

Mirko Agarla
;
Luigi Celona;Raimondo Schettini
In corso di stampa

Abstract

Video memorability prediction aims to quantify how much a given video content will be remembered over time. The main attributes affecting the prediction of memorability are not yet fully understood and many of the methods in the literature are based on features extracted from content recognition models. In this paper we demonstrate that features extracted from a model trained with natural language supervision are effective for estimating video memorability. The proposed method exploits a Vision Transformer pretrained using Contrastive Language-Image Pretraining (CLIP) for encoding video frames. A temporal attention mechanism is then used to select and aggregate relevant frame representations into a video-level feature vector. Finally, a multi-layer perceptron maps the video-level features into a score. We test several types of encoding and temporal aggregation modules and submit our best solution to the MediaEval 2022 Predicting Media Memorability task. We achieve a correlation of 0.707 in subtask 1 (i.e. the Memento10k dataset). In task 2 we obtain a Pearson correlation of 0.487 by training on Memento10k and testing on videoMem and of 0.529 by training on videoMem and testing on Memento10k.
slide + paper
Video memorability;CLIP;Temporal attention module
English
MediaEval 2022: Multimedia Evaluation Workshop
2023
In corso di stampa
https://2022.multimediaeval.com/paper2382.pdf
none
Agarla, M., Celona, L., Schettini, R. (In corso di stampa). Predicting Video Memorability Using a Model Pretrained with Natural Language Supervision. Intervento presentato a: MediaEval 2022: Multimedia Evaluation Workshop, Bergen, Norvegia.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/403112
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact