Word embeddings have proven extremely useful across many NLP applications in recent years. Several key linguistic tasks, such as machine translation and transfer learning, require comparing distributed representations of words belonging to different vector spaces within or among different domains and languages to be aligned, known as embedding alignment. To this end, several existing methods exploit words that are supposed to have the same meaning in the two corpora, called seed lexicon or anchors, as reference points to map one embedding into the other. All those methods consider only the word that is supposed to have the same meaning in the two spaces to choose anchors, while its neighbours or similar words are neglected. We propose SeNSe, an unsupervised method for aligning monolingual embeddings, generating a bilingual dictionary composed of words with the most similar meaning among word vector spaces. Our approach selects a seed lexicon of words used in the same context in both corpora without assuming a priori semantic similarities. We compare our method with well-established benchmarks showing SeNSe outperforms state-of-the-art (SOTA) methods for embedding alignment on bilingual lexicon extraction in most cases.

Malandri, L., Mercorio, F., Mezzanzanica, M., Pallucchini, F. (2024). SeNSe: embedding alignment via semantic anchors selection. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS [10.1007/s41060-024-00522-z].

SeNSe: embedding alignment via semantic anchors selection

Malandri, L;Mercorio, F
;
Mezzanzanica, M;Pallucchini, F
2024

Abstract

Word embeddings have proven extremely useful across many NLP applications in recent years. Several key linguistic tasks, such as machine translation and transfer learning, require comparing distributed representations of words belonging to different vector spaces within or among different domains and languages to be aligned, known as embedding alignment. To this end, several existing methods exploit words that are supposed to have the same meaning in the two corpora, called seed lexicon or anchors, as reference points to map one embedding into the other. All those methods consider only the word that is supposed to have the same meaning in the two spaces to choose anchors, while its neighbours or similar words are neglected. We propose SeNSe, an unsupervised method for aligning monolingual embeddings, generating a bilingual dictionary composed of words with the most similar meaning among word vector spaces. Our approach selects a seed lexicon of words used in the same context in both corpora without assuming a priori semantic similarities. We compare our method with well-established benchmarks showing SeNSe outperforms state-of-the-art (SOTA) methods for embedding alignment on bilingual lexicon extraction in most cases.
Articolo in rivista - Articolo scientifico
Bilingual lexicon induction; Embedding alignment; Information retrieval; Word embedding;
English
20-mar-2024
2024
reserved
Malandri, L., Mercorio, F., Mezzanzanica, M., Pallucchini, F. (2024). SeNSe: embedding alignment via semantic anchors selection. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS [10.1007/s41060-024-00522-z].
File in questo prodotto:
File Dimensione Formato  
Malandri-2024-International Journal of Data Science and Analytics-VoR.pdf

Solo gestori archivio

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Tutti i diritti riservati
Dimensione 1.3 MB
Formato Adobe PDF
1.3 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/468278
Citazioni
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
Social impact