Bicocca Open Archive

This paper addresses the treatment of noise caused by word omissions in a corpus of school writings, in order to facilitate their subsequent automatic processing. While a normalization step may facilitate the processing of these texts, certain linguistic expressions remain challenging to comprehend, particularly in instances where the writer omits words from the text. The present contribution proposes three automatic and semi-automatic potential solutions to this problem. The first method employs a "mask" token in the form of xxx. The second is a semi-automatic approach whereby each morpho-syntactic category proposed during normalization is replaced by the corresponding "prototypical word." The third involves a FlauBERT method, using this language model to "reconstruct" the most probable token in the text. The three methods are evaluated quantitatively, and the results obtained using method 3, which proved to be the most effective in the context of our research, are also presented qualitatively.

Barletta, M., Ponton, C. (2025). La question de la normalisation des écrits scolaires pour leur traitement automatique. Le cas de l’omission de mots. CORPUS, 26 [10.4000/1364v].

La question de la normalisation des écrits scolaires pour leur traitement automatique. Le cas de l’omission de mots

Barletta, Martina^Primo;Ponton, Claude^Secondo

2025

Abstract

This paper addresses the treatment of noise caused by word omissions in a corpus of school writings, in order to facilitate their subsequent automatic processing. While a normalization step may facilitate the processing of these texts, certain linguistic expressions remain challenging to comprehend, particularly in instances where the writer omits words from the text. The present contribution proposes three automatic and semi-automatic potential solutions to this problem. The first method employs a "mask" token in the form of xxx. The second is a semi-automatic approach whereby each morpho-syntactic category proposed during normalization is replaced by the corresponding "prototypical word." The third involves a FlauBERT method, using this language model to "reconstruct" the most probable token in the text. The three methods are evaluated quantitatively, and the results obtained using method 3, which proved to be the most effective in the context of our research, are also presented qualitatively.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				children corpus, NLP, normalization, morphosyntactic analysis
			
	Parole chiave
	
				écrits scolaires, TAL, normalisation, analyse morphosyntaxique
			
	Lingua del contenuto
	
				French
			
	Data di pubblicazione
	
				2025
			
	Rivista
	
				CORPUS
			
	Numero del volume
	
				26
			
	DOI dell'articolo
	
				https://dx.doi.org/10.4000/1364v
			
	URL alternativo
	
				https://journals.openedition.org/corpus/9322
			
	Fulltext
	
				open
			
	Citazione
	
				Barletta, M., Ponton, C. (2025). La question de la normalisation des écrits scolaires pour leur traitement automatique. Le cas de l’omission de mots. CORPUS, 26 [10.4000/1364v].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
corpus-9322.pdf accesso aperto Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Licenza open access specifica dell’editore Dimensione 340.66 kB Formato Adobe PDF Visualizza/Apri	340.66 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/539081

Citazioni

ND

ND

Social impact