Bicocca Open Archive

Word embeddings have proven extremely useful across many NLP applications in recent years. Several key linguistic tasks, such as machine translation and transfer learning, require comparing distributed representations of words belonging to different vector spaces within or among different domains and languages to be aligned, known as embedding alignment. To this end, several existing methods exploit words that are supposed to have the same meaning in the two corpora, called seed lexicon or anchors, as reference points to map one embedding into the other. All those methods consider only the word that is supposed to have the same meaning in the two spaces to choose anchors, while its neighbours or similar words are neglected. We propose SeNSe, an unsupervised method for aligning monolingual embeddings, generating a bilingual dictionary composed of words with the most similar meaning among word vector spaces. Our approach selects a seed lexicon of words used in the same context in both corpora without assuming a priori semantic similarities. We compare our method with well-established benchmarks showing SeNSe outperforms state-of-the-art (SOTA) methods for embedding alignment on bilingual lexicon extraction in most cases.

Malandri, L., Mercorio, F., Mezzanzanica, M., Pallucchini, F. (2024). SeNSe: embedding alignment via semantic anchors selection. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS [10.1007/s41060-024-00522-z].

SeNSe: embedding alignment via semantic anchors selection

Malandri, L;Mercorio, F;Mezzanzanica, M;Pallucchini, F

2024

Abstract

Word embeddings have proven extremely useful across many NLP applications in recent years. Several key linguistic tasks, such as machine translation and transfer learning, require comparing distributed representations of words belonging to different vector spaces within or among different domains and languages to be aligned, known as embedding alignment. To this end, several existing methods exploit words that are supposed to have the same meaning in the two corpora, called seed lexicon or anchors, as reference points to map one embedding into the other. All those methods consider only the word that is supposed to have the same meaning in the two spaces to choose anchors, while its neighbours or similar words are neglected. We propose SeNSe, an unsupervised method for aligning monolingual embeddings, generating a bilingual dictionary composed of words with the most similar meaning among word vector spaces. Our approach selects a seed lexicon of words used in the same context in both corpora without assuming a priori semantic similarities. We compare our method with well-established benchmarks showing SeNSe outperforms state-of-the-art (SOTA) methods for embedding alignment on bilingual lexicon extraction in most cases.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				Bilingual lexicon induction; Embedding alignment; Information retrieval; Word embedding;
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				20-mar-2024
			
	Data di pubblicazione
	
				2024
			
	Rivista
	
				INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1007/s41060-024-00522-z
			
	Fulltext
	
				reserved
			
	Citazione
	
				Malandri, L., Mercorio, F., Mezzanzanica, M., Pallucchini, F. (2024). SeNSe: embedding alignment via semantic anchors selection. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS [10.1007/s41060-024-00522-z].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Malandri-2024-International Journal of Data Science and Analytics-VoR.pdf Solo gestori archivio Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Tutti i diritti riservati Dimensione 1.3 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.3 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/468278

Citazioni

1

0

Social impact