Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to generate and manipulate human language, highlighting their potential across various applications. Evaluating LLMs in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts, thus broadening their usability and effectiveness. We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy. Our study makes three primary contributions: First, we adapt the INVALSI tests as a benchmark for automated LLM evaluation, rigorously adapting the test format to suit automated processing while retaining the essence of the original tests. Second, we provide a detailed assessment of current LLMs, offering a crucial reference point for the academic community. Finally, we visually compare the performance of these models against human results. Additionally, our benchmark is publicly available and provided with a comprehensive evaluation suite (https://github.com/Crisp-Unimib/INVALSI-Eval-Suite), ensuring that the benchmark remains a current and valuable resource relevant for advancing industrial-strength NLP applications

Mercorio, F., Mezzanzanica, M., Potertì, D., Serino, A., Seveso, A. (2025). A Benchmark to Evaluate LLMs’ Proficiency on Italian Student Competencies. In Machine Learning and Knowledge Discovery in Databases. Research Track and Applied Data Science Track European Conference, ECML PKDD 2025, Porto, Portugal, September 15–19, 2025, Proceedings, Part VIII (pp.292-309). Springer Berlin, Heidelberg [10.1007/978-3-662-72243-5_17].

A Benchmark to Evaluate LLMs’ Proficiency on Italian Student Competencies

Mercorio, Fabio
;
Mezzanzanica, Mario;Potertì, Daniele;Serino, Antonio;Seveso, Andrea
2025

Abstract

Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to generate and manipulate human language, highlighting their potential across various applications. Evaluating LLMs in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts, thus broadening their usability and effectiveness. We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy. Our study makes three primary contributions: First, we adapt the INVALSI tests as a benchmark for automated LLM evaluation, rigorously adapting the test format to suit automated processing while retaining the essence of the original tests. Second, we provide a detailed assessment of current LLMs, offering a crucial reference point for the academic community. Finally, we visually compare the performance of these models against human results. Additionally, our benchmark is publicly available and provided with a comprehensive evaluation suite (https://github.com/Crisp-Unimib/INVALSI-Eval-Suite), ensuring that the benchmark remains a current and valuable resource relevant for advancing industrial-strength NLP applications
paper
invalsi; artificial intelligence; NLP; LLM
English
European Conference, ECML PKDD 2025 - September 15–19, 2025
2025
Pfahringer, B; Japkowicz, N; Larrañaga, P; Ribeiro, RP; Dutra, I; Pechenizkiy, M; Cortez, P; Pashami, S; Jorge, AM; Soares, C; Abreu, PH; Gama, J
Machine Learning and Knowledge Discovery in Databases. Research Track and Applied Data Science Track European Conference, ECML PKDD 2025, Porto, Portugal, September 15–19, 2025, Proceedings, Part VIII
9783662722428
4-ott-2025
2025
292
309
none
Mercorio, F., Mezzanzanica, M., Potertì, D., Serino, A., Seveso, A. (2025). A Benchmark to Evaluate LLMs’ Proficiency on Italian Student Competencies. In Machine Learning and Knowledge Discovery in Databases. Research Track and Applied Data Science Track European Conference, ECML PKDD 2025, Porto, Portugal, September 15–19, 2025, Proceedings, Part VIII (pp.292-309). Springer Berlin, Heidelberg [10.1007/978-3-662-72243-5_17].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/569582
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact