Over the last decade, word embeddings have enabled machines to represent words and sentences as vectors, enabling researchers to reason on text for tasks like semantic similarity, contextual understanding, machine translation, etc. However, the synthesis of embeddings involves domain-specific parameters that affect semantic accuracy and contextual relevance, often leading to unpredictable biases and inconsistent comparisons. This issue is particularly relevant in labor market analysis, where different embeddings yield varying results, making the selection of the most appropriate model a key element. This paper addresses these challenges by (i) proposing a methodology to train, select, and align vector space models for a target taxonomy, ensuring comparability across dimensions and languages; (ii) applying this approach to 4.5 million job ads in 28 languages, aligning country-specific embeddings using the ESCO taxonomy; (iii) generating over 3000 models over 142 machine days, making the best-performing ones publicly available via VEUCTOR; and (iv) showing how model choice significantly impacts labor market analysis, revealing substantial variations in occupational skill bundles across embeddings.

Colombo, E., D'Amico, S., Mercorio, F., Mezzanzanica, M. (2026). VEUCTOR: Training and selecting best vector space models from online job ads for European countries. INFORMATION SCIENCES, 741(15 June 2026) [10.1016/j.ins.2026.123274].

VEUCTOR: Training and selecting best vector space models from online job ads for European countries

D'Amico, Simone;Mercorio, Fabio
;
Mezzanzanica, Mario
2026

Abstract

Over the last decade, word embeddings have enabled machines to represent words and sentences as vectors, enabling researchers to reason on text for tasks like semantic similarity, contextual understanding, machine translation, etc. However, the synthesis of embeddings involves domain-specific parameters that affect semantic accuracy and contextual relevance, often leading to unpredictable biases and inconsistent comparisons. This issue is particularly relevant in labor market analysis, where different embeddings yield varying results, making the selection of the most appropriate model a key element. This paper addresses these challenges by (i) proposing a methodology to train, select, and align vector space models for a target taxonomy, ensuring comparability across dimensions and languages; (ii) applying this approach to 4.5 million job ads in 28 languages, aligning country-specific embeddings using the ESCO taxonomy; (iii) generating over 3000 models over 142 machine days, making the best-performing ones publicly available via VEUCTOR; and (iv) showing how model choice significantly impacts labor market analysis, revealing substantial variations in occupational skill bundles across embeddings.
Articolo in rivista - Articolo scientifico
Labor market; Machine learning; NLP; Word embedding;
English
21-feb-2026
2026
741
15 June 2026
123274
open
Colombo, E., D'Amico, S., Mercorio, F., Mezzanzanica, M. (2026). VEUCTOR: Training and selecting best vector space models from online job ads for European countries. INFORMATION SCIENCES, 741(15 June 2026) [10.1016/j.ins.2026.123274].
File in questo prodotto:
File Dimensione Formato  
Colombo et al-2026-Information Sciences-VoR.pdf

accesso aperto

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Creative Commons
Dimensione 6.74 MB
Formato Adobe PDF
6.74 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/594429
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
Social impact