Automatic Document Classification (ADC) refers to the process of automatically categorizing or labeling documents into predefined classes or categories. Its effectiveness may depend on various factors, including the models used for the formal representation of documents, the classification techniques applied, or a combination of both. Recently, Transformer models have gained popularity due to their pre-training on large corpora, allowing for flexible knowledge transfer to downstream tasks, such as ADC. However, such models can face challenges when handling “long” documents, particularly due to input sequence length constraints, which can have knock-on effects on the task we refer to as Automatic Long Document Classification (ALDC). Distinct models for tackling this limitation of Transformers have been proposed over the past few years, and employed to perform ALDC; however, their application to this task has resulted in some inconsistent outcomes, struggles to surpass simple baselines, and difficulties in generalizing across diverse datasets and scenarios. That is why this survey aims to illustrate these limitations, by: (i) presenting current long document representation issues and solutions proposed in the literature; (ii) based on such solutions, illustrating a comprehensive analysis of their application in ALDC and their effectiveness; and (iii) discussing current evaluation strategies in ALDC with particular reference to suitable baselines and actual long-document benchmark datasets.
Alva Principe, R., Chiarini, N., Viviani, M. (2025). Long Document Classification in the Transformer Era: A Survey on Challenges, Advances, and Open Issues. WILEY INTERDISCIPLINARY REVIEWS. DATA MINING AND KNOWLEDGE DISCOVERY, 15(2 (June 2025)) [10.1002/widm.70019].
Long Document Classification in the Transformer Era: A Survey on Challenges, Advances, and Open Issues
Alva Principe, R;Viviani, M
2025
Abstract
Automatic Document Classification (ADC) refers to the process of automatically categorizing or labeling documents into predefined classes or categories. Its effectiveness may depend on various factors, including the models used for the formal representation of documents, the classification techniques applied, or a combination of both. Recently, Transformer models have gained popularity due to their pre-training on large corpora, allowing for flexible knowledge transfer to downstream tasks, such as ADC. However, such models can face challenges when handling “long” documents, particularly due to input sequence length constraints, which can have knock-on effects on the task we refer to as Automatic Long Document Classification (ALDC). Distinct models for tackling this limitation of Transformers have been proposed over the past few years, and employed to perform ALDC; however, their application to this task has resulted in some inconsistent outcomes, struggles to surpass simple baselines, and difficulties in generalizing across diverse datasets and scenarios. That is why this survey aims to illustrate these limitations, by: (i) presenting current long document representation issues and solutions proposed in the literature; (ii) based on such solutions, illustrating a comprehensive analysis of their application in ALDC and their effectiveness; and (iii) discussing current evaluation strategies in ALDC with particular reference to suitable baselines and actual long-document benchmark datasets.| File | Dimensione | Formato | |
|---|---|---|---|
|
Alva Principe-2025-WIREs Data Mining and Knowledge Discovery-VoR.pdf
accesso aperto
Descrizione: This is an open access article under the terms of the Creative Commons Attribution License
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Licenza:
Creative Commons
Dimensione
1.45 MB
Formato
Adobe PDF
|
1.45 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


