We present a new gold-standard dataset and a benchmark for the Research Theme Identification task, a sub-task of the Scholarly Knowledge Graph Generation shared task, at the 3rd Workshop on Scholarly Document Processing. The objective of the shared task was to label given research papers with research themes from a total of 36 themes. The benchmark was compiled using data drawn from the largest overall assessment of university research output ever undertaken globally (the Research Excellence Framework - 2014). We provide a performance comparison of a transformer-based ensemble, which obtains multiple predictions for a research paper, given its multiple textual fields (e.g. title, abstract, reference), with traditional machine learning models. The ensemble involves enriching the initial data with additional information from open-access digital libraries and Argumentative Zoning techniques (Teufel et al., 1999b). It uses a weighted sum aggregation for the multiple predictions to obtain a final single prediction for the given research paper. Both data and the ensemble are publicly available on https://www.kaggle.com/ and https://github.com/ProjectDoSSIER/sdp2022, respectively.

Mendoza, O., Kusa, W., El-Ebshihy, A., Wu, R., Pride, D., Knoth, P., et al. (2022). Benchmark for Research Theme Classification of Scholarly Documents. In Proceedings - International Conference on Computational Linguistics, COLING (pp.253-262). Association for Computational Linguistics (ACL).

Benchmark for Research Theme Classification of Scholarly Documents

Pasi G.;
2022

Abstract

We present a new gold-standard dataset and a benchmark for the Research Theme Identification task, a sub-task of the Scholarly Knowledge Graph Generation shared task, at the 3rd Workshop on Scholarly Document Processing. The objective of the shared task was to label given research papers with research themes from a total of 36 themes. The benchmark was compiled using data drawn from the largest overall assessment of university research output ever undertaken globally (the Research Excellence Framework - 2014). We provide a performance comparison of a transformer-based ensemble, which obtains multiple predictions for a research paper, given its multiple textual fields (e.g. title, abstract, reference), with traditional machine learning models. The ensemble involves enriching the initial data with additional information from open-access digital libraries and Argumentative Zoning techniques (Teufel et al., 1999b). It uses a weighted sum aggregation for the multiple predictions to obtain a final single prediction for the given research paper. Both data and the ensemble are publicly available on https://www.kaggle.com/ and https://github.com/ProjectDoSSIER/sdp2022, respectively.
paper
Digital libraries; HTTP; Hypertext systems; Knowledge graph
English
3rd Workshop on Scholarly Document Processing, SDP 2022 at 29th International Conference on Computational Linguistics, COLING 2022 - 12 October 2022through 17 October 2022
2022
Cohan, A; Feigenblat, G; Freitag, D; Ghosal, T; Herrmannova, D; Knoth, P; Lo, K; Mayr, P; Shmueli-Scheuer, M; de Waard, A; Wang, LL
Proceedings - International Conference on Computational Linguistics, COLING
2022
29
9
253
262
none
Mendoza, O., Kusa, W., El-Ebshihy, A., Wu, R., Pride, D., Knoth, P., et al. (2022). Benchmark for Research Theme Classification of Scholarly Documents. In Proceedings - International Conference on Computational Linguistics, COLING (pp.253-262). Association for Computational Linguistics (ACL).
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/557165
Citazioni
  • Scopus 3
  • ???jsp.display-item.citation.isi??? ND
Social impact