Bicocca Open Archive

The ability to correctly model distinct meanings of a word is crucial for the effectiveness of semantic representation techniques. However, most existing evaluation benchmarks for assessing this criterion are tied to sense inventories (usually WordNet), restricting their usage to a small subset of knowledge-based representation techniques. The Word-in-Context dataset (WiC) addresses the dependence on sense inventories by reformulating the standard disambiguation task as a binary classification problem; but, it is limited to the English language. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages from varied language families and with different degrees of resource availability, opening room for evaluation scenarios such as zero-shot cross-lingual transfer. We perform a series of experiments to determine the reliability of the datasets and to set performance baselines for several recent contextualized multilingual models. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance in the task of distinguishing different meanings of a word, even for distant languages. XL-WiC is available at https://pilehvar.github.io/xlwic/.

Raganato, A., Pasini, T., Camacho-Collados, J., Pilehvar, M. (2020). XL-WiC: A multilingual benchmark for evaluating semantic contextualization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp.7193-7206). Association for Computational Linguistics (ACL) [10.18653/v1/2020.emnlp-main.584].

XL-WiC: A multilingual benchmark for evaluating semantic contextualization

Raganato, A;Pasini, T;Camacho-Collados, J;Pilehvar, MT

2020

Abstract

The ability to correctly model distinct meanings of a word is crucial for the effectiveness of semantic representation techniques. However, most existing evaluation benchmarks for assessing this criterion are tied to sense inventories (usually WordNet), restricting their usage to a small subset of knowledge-based representation techniques. The Word-in-Context dataset (WiC) addresses the dependence on sense inventories by reformulating the standard disambiguation task as a binary classification problem; but, it is limited to the English language. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages from varied language families and with different degrees of resource availability, opening room for evaluation scenarios such as zero-shot cross-lingual transfer. We perform a series of experiments to determine the reliability of the datasets and to set performance baselines for several recent contextualized multilingual models. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance in the task of distinguishing different meanings of a word, even for distant languages. XL-WiC is available at https://pilehvar.github.io/xlwic/.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
				paper
			
	Parole chiave
	
				word sense disambiguation; neural networks; deep learning; multilinguality
			
	Lingua del contenuto
	
				English
			
	Nome del convegno
	
				EMNLP 2020. The 2020 Conference on Empirical Methods in Natural Language Processing
			
	Anno del convegno
	
				2020
			
	Titolo degli atti
	
				Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
			
	ISBN del volume degli atti
	
				978-1-952148-60-6
			
	Data di pubblicazione
	
				2020
			
	Pagina iniziale
	
				7193
			
	Pagina finale
	
				7206
			
	DOI dell'intervento
	
				https://dx.doi.org/10.18653/v1/2020.emnlp-main.584
			
	Fulltext
	
				partially_open
			
	Citazione
	
				Raganato, A., Pasini, T., Camacho-Collados, J., Pilehvar, M. (2020). XL-WiC: A multilingual benchmark for evaluating semantic contextualization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp.7193-7206). Association for Computational Linguistics (ACL) [10.18653/v1/2020.emnlp-main.584].
			
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

File	Dimensione	Formato
2020.emnlp-main.584.pdf Solo gestori archivio Dimensione 606.47 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	606.47 kB	Adobe PDF	Visualizza/Apri Richiedi una copia
10281-361586_VoR.pdf accesso aperto Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 606.47 kB Formato Adobe PDF Visualizza/Apri	606.47 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/361586

Citazioni

50

23

Social impact