The use of online services for advertising job positions has grown in the last decade, thanks to the ability of Online Job Advertisements (OJAs) to observe the labour market in near real-time, predict new occupation trends, identify relevant skills, and support policy and decision-making activities. Unsurprisingly, 2023 was declared the Year of Skills by the EU, as skill mismatch is a key challenge for European economies. In such a scenario, machine learning-based approaches have played a key role in classifying job ads and extracting skills according to well-established taxonomies. However, the effectiveness of ML depends on access to annotated job advertisement datasets, which are often limited and require time-consuming manual annotation. The lack of OJA annotated benchmarks representative of the real online OJA and skills distributions is currently limiting advances in skill intelligence. To deal with this, we propose JobGen, which leverages Large Language Models (LLMs) to generate synthetic OJAs. We use real OJAs collected from an EU project and the ESCO taxonomy to represent job market distributions accurately. JobGen enhances data diversity and semantic alignment, addressing common issues in synthetic data generation. The resulting dataset, JobSet, provides a valuable resource for tasks like skill extraction and job matching and is openly available to the community
Colombo, S., D'Amico, S., Malandri, L., Mercorio, F., Seveso, A. (2025). JobSet: Synthetic Job Advertisements Dataset for Labour Market Intelligence. In SAC '25: Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing (pp.928-935) [10.1145/3672608.3707718].
JobSet: Synthetic Job Advertisements Dataset for Labour Market Intelligence
D'Amico, Simone;Malandri, Lorenzo;Mercorio, Fabio;Seveso, Andrea
2025
Abstract
The use of online services for advertising job positions has grown in the last decade, thanks to the ability of Online Job Advertisements (OJAs) to observe the labour market in near real-time, predict new occupation trends, identify relevant skills, and support policy and decision-making activities. Unsurprisingly, 2023 was declared the Year of Skills by the EU, as skill mismatch is a key challenge for European economies. In such a scenario, machine learning-based approaches have played a key role in classifying job ads and extracting skills according to well-established taxonomies. However, the effectiveness of ML depends on access to annotated job advertisement datasets, which are often limited and require time-consuming manual annotation. The lack of OJA annotated benchmarks representative of the real online OJA and skills distributions is currently limiting advances in skill intelligence. To deal with this, we propose JobGen, which leverages Large Language Models (LLMs) to generate synthetic OJAs. We use real OJAs collected from an EU project and the ESCO taxonomy to represent job market distributions accurately. JobGen enhances data diversity and semantic alignment, addressing common issues in synthetic data generation. The resulting dataset, JobSet, provides a valuable resource for tasks like skill extraction and job matching and is openly available to the communityI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.