Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach called Hierarchical Cross Encoder (HCE), based on a pre-trained language model that relies only on label description and respects the label taxonomy. To test our proposed model, we used both state of the art datasets and an industrial dataset which comes from contrattipubblici.org (a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years.

Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach called Hierarchical Cross Encoder (HCE), based on a pre-trained language model that relies only on label description and respects the label taxonomy. To test our proposed model, we used both state of the art datasets and an industrial dataset which comes from contrattipubblici.org (a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years.

(2025). Zero-Shot Hierarchical Short Text Classification. (Tesi di dottorato, , 2025).

Zero-Shot Hierarchical Short Text Classification

MOIRAGHI MOTTA, FEDERICO
2025

Abstract

Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach called Hierarchical Cross Encoder (HCE), based on a pre-trained language model that relies only on label description and respects the label taxonomy. To test our proposed model, we used both state of the art datasets and an industrial dataset which comes from contrattipubblici.org (a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years.
PALMONARI, MATTEO LUIGI
nlp; tassonomia; gerarchico; modello linguistico; bert
nlp; taxonomy; hierarchical; language model; bert
INF/01 - INFORMATICA
English
27-feb-2025
37
2023/2024
open
(2025). Zero-Shot Hierarchical Short Text Classification. (Tesi di dottorato, , 2025).
File in questo prodotto:
File Dimensione Formato  
phd_unimib_799735.pdf

accesso aperto

Descrizione: Zero-Shot Hierarchical Short Text Classification
Tipologia di allegato: Doctoral thesis
Dimensione 2.47 MB
Formato Adobe PDF
2.47 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/543822
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact