The present paper provides an overall framework to afford the problem of non-representativeness and non-random selectivity arising from online job ads data, using Generalized sample selection models and Eurostat benchmark data. We jointly model the outcome intensity (number of online job ads in observed profiles, whose levels are defined by auxiliary variables) and the probability of endogenous selection (likelihood that online job ads are not missing in a given profile), allowing us to model the missing data mechanism without the need of a priori justification of missingness at random, as generally supposed by multilevel regression and post-stratification, a popular benchmark technique in this field. Moreover, we offer new post-stratification strategies to calibrate the unconditional predictions on benchmark/reference samples. We use data from the Cedefop's Skill Ovate platform collecting online job advertisements for all EU regions in 2022 and an Italian web-platform during 2013Q2-2018Q2, whereas as reference samples, aggregated LFS recent job starters and LFS new hires from microdata that represent reasonable lower bounds for job advertisements. Online job ads present a strong overrepresentation with respect to benchmark data (+40% with respect to LFS recent job starters and +400% over new hires from LFS microdata), whereas generalized sample selection models reduced this bias by half, unlike Multilevel post-stratification and other univariate approaches, which furthermore resulted in bias.
Lovaglio, P., Mezzanzanica, M. (2026). Analyzing Non‐Random Selectivity in Online Job Advertisements Using Eurostat Benchmark Data and Generalized Sample Selection Models: An Application to EU Regional Labor Markets. LABOUR [10.1111/labr.70008].
Analyzing Non‐Random Selectivity in Online Job Advertisements Using Eurostat Benchmark Data and Generalized Sample Selection Models: An Application to EU Regional Labor Markets
Lovaglio, Pietro Giorgio
;Mezzanzanica, Mario
2026
Abstract
The present paper provides an overall framework to afford the problem of non-representativeness and non-random selectivity arising from online job ads data, using Generalized sample selection models and Eurostat benchmark data. We jointly model the outcome intensity (number of online job ads in observed profiles, whose levels are defined by auxiliary variables) and the probability of endogenous selection (likelihood that online job ads are not missing in a given profile), allowing us to model the missing data mechanism without the need of a priori justification of missingness at random, as generally supposed by multilevel regression and post-stratification, a popular benchmark technique in this field. Moreover, we offer new post-stratification strategies to calibrate the unconditional predictions on benchmark/reference samples. We use data from the Cedefop's Skill Ovate platform collecting online job advertisements for all EU regions in 2022 and an Italian web-platform during 2013Q2-2018Q2, whereas as reference samples, aggregated LFS recent job starters and LFS new hires from microdata that represent reasonable lower bounds for job advertisements. Online job ads present a strong overrepresentation with respect to benchmark data (+40% with respect to LFS recent job starters and +400% over new hires from LFS microdata), whereas generalized sample selection models reduced this bias by half, unlike Multilevel post-stratification and other univariate approaches, which furthermore resulted in bias.| File | Dimensione | Formato | |
|---|---|---|---|
|
Lovaglio-2026-Labour-VoR.pdf
accesso aperto
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Licenza:
Creative Commons
Dimensione
747.77 kB
Formato
Adobe PDF
|
747.77 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


