Bicocca Open Archive

Automatic Document Classification (ADC) refers to the process of automatically categorizing or labeling documents into predefined classes or categories. Its effectiveness may depend on various factors, including the models used for the formal representation of documents, the classification techniques applied, or a combination of both. Recently, Transformer models have gained popularity due to their pre-training on large corpora, allowing for flexible knowledge transfer to downstream tasks, such as ADC. However, such models can face challenges when handling “long” documents, particularly due to input sequence length constraints, which can have knock-on effects on the task we refer to as Automatic Long Document Classification (ALDC). Distinct models for tackling this limitation of Transformers have been proposed over the past few years, and employed to perform ALDC; however, their application to this task has resulted in some inconsistent outcomes, struggles to surpass simple baselines, and difficulties in generalizing across diverse datasets and scenarios. That is why this survey aims to illustrate these limitations, by: (i) presenting current long document representation issues and solutions proposed in the literature; (ii) based on such solutions, illustrating a comprehensive analysis of their application in ALDC and their effectiveness; and (iii) discussing current evaluation strategies in ALDC with particular reference to suitable baselines and actual long-document benchmark datasets.

Alva Principe, R., Chiarini, N., Viviani, M. (2025). Long Document Classification in the Transformer Era: A Survey on Challenges, Advances, and Open Issues. WILEY INTERDISCIPLINARY REVIEWS. DATA MINING AND KNOWLEDGE DISCOVERY, 15(2 (June 2025)) [10.1002/widm.70019].

Long Document Classification in the Transformer Era: A Survey on Challenges, Advances, and Open Issues

Alva Principe, R;Chiarini, N;Viviani, M

2025

Abstract

Automatic Document Classification (ADC) refers to the process of automatically categorizing or labeling documents into predefined classes or categories. Its effectiveness may depend on various factors, including the models used for the formal representation of documents, the classification techniques applied, or a combination of both. Recently, Transformer models have gained popularity due to their pre-training on large corpora, allowing for flexible knowledge transfer to downstream tasks, such as ADC. However, such models can face challenges when handling “long” documents, particularly due to input sequence length constraints, which can have knock-on effects on the task we refer to as Automatic Long Document Classification (ALDC). Distinct models for tackling this limitation of Transformers have been proposed over the past few years, and employed to perform ALDC; however, their application to this task has resulted in some inconsistent outcomes, struggles to surpass simple baselines, and difficulties in generalizing across diverse datasets and scenarios. That is why this survey aims to illustrate these limitations, by: (i) presenting current long document representation issues and solutions proposed in the literature; (ii) based on such solutions, illustrating a comprehensive analysis of their application in ALDC and their effectiveness; and (iii) discussing current evaluation strategies in ALDC with particular reference to suitable baselines and actual long-document benchmark datasets.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Review Essay
			
	Parole chiave
	
				automatic document classification (ADC); automatic long document classification (ALDC); contextualized word embedding; natural language processing (NLP); text representation; transformers;
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				8-mag-2025
			
	Data di pubblicazione
	
				2025
			
	Rivista
	
				WILEY INTERDISCIPLINARY REVIEWS. DATA MINING AND KNOWLEDGE DISCOVERY
			
	Numero del volume
	
				15
			
	Fascicolo
	
				2 (June 2025)
			
	Article number
	
				e70019
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1002/widm.70019
			
	Fulltext
	
				open
			
	Citazione
	
				Alva Principe, R., Chiarini, N., Viviani, M. (2025). Long Document Classification in the Transformer Era: A Survey on Challenges, Advances, and Open Issues. WILEY INTERDISCIPLINARY REVIEWS. DATA MINING AND KNOWLEDGE DISCOVERY, 15(2 (June 2025)) [10.1002/widm.70019].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Alva Principe-2025-WIREs Data Mining and Knowledge Discovery-VoR.pdf accesso aperto Descrizione: This is an open access article under the terms of the Creative Commons Attribution License Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 1.45 MB Formato Adobe PDF Visualizza/Apri	1.45 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/551921

Citazioni

1

1

Social impact