Bicocca Open Archive

In scientific papers, it is common practice to cite other articles to substantiate claims, provide evidence for factual assertions, reference limitations, and research gaps, and fulfill various other purposes. When authors include a citation in a given sentence, there are two considerations they need to take into account: (i) where in the sentence to place the citation and (ii) which citation to choose to support the underlying claim. In this paper, we focus on the first task as it allows multiple potential approaches that rely on the researcher's individual style and the specific norms and conventions of the relevant scientific community. We propose two automatic methodologies that leverage transformers architecture for either solving a Mask-Filling problem or a Named Entity Recognition problem. On top of the results of the proposed methodologies, we apply ad-hoc Natural Language Processing heuristics to further improve their outcome. We also introduce s2orc-9K, an open dataset for fine-tuning models on this task. A formal evaluation demonstrates that the generative approach significantly outperforms five alternative methods when fine-tuned on the novel dataset. Furthermore, this model's results show no statistically significant deviation from the outputs of three senior researchers.

Buscaldi, D., Dessi, D., Motta, E., Murgia, M., Osborne, F., Reforgiato Recupero, D. (2024). Citation prediction by leveraging transformers and natural language processing heuristics. INFORMATION PROCESSING & MANAGEMENT, 61(1 (January 2024)) [10.1016/j.ipm.2023.103583].

Citation prediction by leveraging transformers and natural language processing heuristics

Buscaldi, D;Dessi, D;Motta, E;Murgia, M;Osborne, F;Reforgiato Recupero, D

2024

Abstract

In scientific papers, it is common practice to cite other articles to substantiate claims, provide evidence for factual assertions, reference limitations, and research gaps, and fulfill various other purposes. When authors include a citation in a given sentence, there are two considerations they need to take into account: (i) where in the sentence to place the citation and (ii) which citation to choose to support the underlying claim. In this paper, we focus on the first task as it allows multiple potential approaches that rely on the researcher's individual style and the specific norms and conventions of the relevant scientific community. We propose two automatic methodologies that leverage transformers architecture for either solving a Mask-Filling problem or a Named Entity Recognition problem. On top of the results of the proposed methodologies, we apply ad-hoc Natural Language Processing heuristics to further improve their outcome. We also introduce s2orc-9K, an open dataset for fine-tuning models on this task. A formal evaluation demonstrates that the generative approach significantly outperforms five alternative methods when fine-tuned on the novel dataset. Furthermore, this model's results show no statistically significant deviation from the outputs of three senior researchers.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				BERT; Citation prediction; Mask-filling; Named entity recognition; Transformers architecture;
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				16-nov-2023
			
	Data di pubblicazione
	
				2024
			
	Rivista
	
				INFORMATION PROCESSING & MANAGEMENT
			
	Numero del volume
	
				61
			
	Fascicolo
	
				1 (January 2024)
			
	Article number
	
				103583
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1016/j.ipm.2023.103583
			
	Fulltext
	
				open
			
	Citazione
	
				Buscaldi, D., Dessi, D., Motta, E., Murgia, M., Osborne, F., Reforgiato Recupero, D. (2024). Citation prediction by leveraging transformers and natural language processing heuristics. INFORMATION PROCESSING & MANAGEMENT, 61(1 (January 2024)) [10.1016/j.ipm.2023.103583].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Buscaldi-2024-Inform Process Manag-VoR.pdf accesso aperto Descrizione: Research Article Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 1.12 MB Formato Adobe PDF Visualizza/Apri	1.12 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/453079

Citazioni

3

3

Social impact