This paper addresses the treatment of noise caused by word omissions in a corpus of school writings, in order to facilitate their subsequent automatic processing. While a normalization step may facilitate the processing of these texts, certain linguistic expressions remain challenging to comprehend, particularly in instances where the writer omits words from the text. The present contribution proposes three automatic and semi-automatic potential solutions to this problem. The first method employs a "mask" token in the form of xxx. The second is a semi-automatic approach whereby each morpho-syntactic category proposed during normalization is replaced by the corresponding "prototypical word." The third involves a FlauBERT method, using this language model to "reconstruct" the most probable token in the text. The three methods are evaluated quantitatively, and the results obtained using method 3, which proved to be the most effective in the context of our research, are also presented qualitatively.

Barletta, M., Ponton, C. (2025). La question de la normalisation des écrits scolaires pour leur traitement automatique. Le cas de l’omission de mots. CORPUS, 26 [10.4000/1364v].

La question de la normalisation des écrits scolaires pour leur traitement automatique. Le cas de l’omission de mots

Barletta, Martina
Primo
;
2025

Abstract

This paper addresses the treatment of noise caused by word omissions in a corpus of school writings, in order to facilitate their subsequent automatic processing. While a normalization step may facilitate the processing of these texts, certain linguistic expressions remain challenging to comprehend, particularly in instances where the writer omits words from the text. The present contribution proposes three automatic and semi-automatic potential solutions to this problem. The first method employs a "mask" token in the form of xxx. The second is a semi-automatic approach whereby each morpho-syntactic category proposed during normalization is replaced by the corresponding "prototypical word." The third involves a FlauBERT method, using this language model to "reconstruct" the most probable token in the text. The three methods are evaluated quantitatively, and the results obtained using method 3, which proved to be the most effective in the context of our research, are also presented qualitatively.
Articolo in rivista - Articolo scientifico
children corpus, NLP, normalization, morphosyntactic analysis
écrits scolaires, TAL, normalisation, analyse morphosyntaxique
French
2025
26
open
Barletta, M., Ponton, C. (2025). La question de la normalisation des écrits scolaires pour leur traitement automatique. Le cas de l’omission de mots. CORPUS, 26 [10.4000/1364v].
File in questo prodotto:
File Dimensione Formato  
corpus-9322.pdf

accesso aperto

Tipologia di allegato: Publisher’s Version (Version of Record, VoR)
Licenza: Licenza open access specifica dell’editore
Dimensione 340.66 kB
Formato Adobe PDF
340.66 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/539081
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact