Bicocca Open Archive

Over the last decade, word embeddings have enabled machines to represent words and sentences as vectors, enabling researchers to reason on text for tasks like semantic similarity, contextual understanding, machine translation, etc. However, the synthesis of embeddings involves domain-specific parameters that affect semantic accuracy and contextual relevance, often leading to unpredictable biases and inconsistent comparisons. This issue is particularly relevant in labor market analysis, where different embeddings yield varying results, making the selection of the most appropriate model a key element. This paper addresses these challenges by (i) proposing a methodology to train, select, and align vector space models for a target taxonomy, ensuring comparability across dimensions and languages; (ii) applying this approach to 4.5 million job ads in 28 languages, aligning country-specific embeddings using the ESCO taxonomy; (iii) generating over 3000 models over 142 machine days, making the best-performing ones publicly available via VEUCTOR; and (iv) showing how model choice significantly impacts labor market analysis, revealing substantial variations in occupational skill bundles across embeddings.

Colombo, E., D'Amico, S., Mercorio, F., Mezzanzanica, M. (2026). VEUCTOR: Training and selecting best vector space models from online job ads for European countries. INFORMATION SCIENCES, 741(15 June 2026) [10.1016/j.ins.2026.123274].

VEUCTOR: Training and selecting best vector space models from online job ads for European countries

Colombo, Emilio;D'Amico, Simone;Mercorio, Fabio;Mezzanzanica, Mario

2026

Abstract

Over the last decade, word embeddings have enabled machines to represent words and sentences as vectors, enabling researchers to reason on text for tasks like semantic similarity, contextual understanding, machine translation, etc. However, the synthesis of embeddings involves domain-specific parameters that affect semantic accuracy and contextual relevance, often leading to unpredictable biases and inconsistent comparisons. This issue is particularly relevant in labor market analysis, where different embeddings yield varying results, making the selection of the most appropriate model a key element. This paper addresses these challenges by (i) proposing a methodology to train, select, and align vector space models for a target taxonomy, ensuring comparability across dimensions and languages; (ii) applying this approach to 4.5 million job ads in 28 languages, aligning country-specific embeddings using the ESCO taxonomy; (iii) generating over 3000 models over 142 machine days, making the best-performing ones publicly available via VEUCTOR; and (iv) showing how model choice significantly impacts labor market analysis, revealing substantial variations in occupational skill bundles across embeddings.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				Labor market; Machine learning; NLP; Word embedding;
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				21-feb-2026
			
	Data di pubblicazione
	
				2026
			
	Rivista
	
				INFORMATION SCIENCES
			
	Numero del volume
	
				741
			
	Fascicolo
	
				15 June 2026
			
	Article number
	
				123274
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1016/j.ins.2026.123274
			
	Fulltext
	
				open
			
	Citazione
	
				Colombo, E., D'Amico, S., Mercorio, F., Mezzanzanica, M. (2026). VEUCTOR: Training and selecting best vector space models from online job ads for European countries. INFORMATION SCIENCES, 741(15 June 2026) [10.1016/j.ins.2026.123274].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Colombo et al-2026-Information Sciences-VoR.pdf accesso aperto Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 6.74 MB Formato Adobe PDF Visualizza/Apri	6.74 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/594429

Citazioni

0

0

Social impact