Bicocca Open Archive

Video memorability prediction aims to quantify how much a given video content will be remembered over time. The main attributes affecting the prediction of memorability are not yet fully understood and many of the methods in the literature are based on features extracted from content recognition models. In this paper we demonstrate that features extracted from a model trained with natural language supervision are effective for estimating video memorability. The proposed method exploits a Vision Transformer pretrained using Contrastive Language-Image Pretraining (CLIP) for encoding video frames. A temporal attention mechanism is then used to select and aggregate relevant frame representations into a video-level feature vector. Finally, a multi-layer perceptron maps the video-level features into a score. We test several types of encoding and temporal aggregation modules and submit our best solution to the MediaEval 2022 Predicting Media Memorability task. We achieve a correlation of 0.707 in subtask 1 (i.e. the Memento10k dataset). In task 2 we obtain a Pearson correlation of 0.487 by training on Memento10k and testing on videoMem and of 0.529 by training on videoMem and testing on Memento10k.

Agarla, M., Celona, L., Schettini, R. (2023). Predicting Video Memorability Using a Model Pretrained with Natural Language Supervision. In Working Notes Proceedings of the MediaEval 2022 Workshop. CEUR-WS.

Predicting Video Memorability Using a Model Pretrained with Natural Language Supervision

Mirko Agarla;Luigi Celona;Raimondo Schettini

2023

Abstract

Video memorability prediction aims to quantify how much a given video content will be remembered over time. The main attributes affecting the prediction of memorability are not yet fully understood and many of the methods in the literature are based on features extracted from content recognition models. In this paper we demonstrate that features extracted from a model trained with natural language supervision are effective for estimating video memorability. The proposed method exploits a Vision Transformer pretrained using Contrastive Language-Image Pretraining (CLIP) for encoding video frames. A temporal attention mechanism is then used to select and aggregate relevant frame representations into a video-level feature vector. Finally, a multi-layer perceptron maps the video-level features into a score. We test several types of encoding and temporal aggregation modules and submit our best solution to the MediaEval 2022 Predicting Media Memorability task. We achieve a correlation of 0.707 in subtask 1 (i.e. the Memento10k dataset). In task 2 we obtain a Pearson correlation of 0.487 by training on Memento10k and testing on videoMem and of 0.529 by training on videoMem and testing on Memento10k.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
				slide + paper
			
	Parole chiave
	
				Video memorability; CLIP; Temporal attention module
			
	Lingua del contenuto
	
				English
			
	Nome del convegno
	
				2022 MediaEval Workshop, MediaEval 2022 - 12-13 January 2023
			
	Anno del convegno
	
				2023
			
	Titolo degli atti
	
				Working Notes Proceedings of the MediaEval 2022 Workshop
			
	Collana o serie
	
				CEUR WORKSHOP PROCEEDINGS
			
	Data di pubblicazione
	
				2023
			
	Numero del volume
	
				3583
			
	URL alternativo
	
				https://2022.multimediaeval.com/paper2382.pdf
			
	Fulltext
	
				none
			
	Citazione
	
				Agarla, M., Celona, L., Schettini, R. (2023). Predicting Video Memorability Using a Model Pretrained with Natural Language Supervision. In Working Notes Proceedings of the MediaEval 2022 Workshop. CEUR-WS.
			
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/403112

Citazioni

1

ND

Social impact