BERToldo, the Historical BERT for Italian

IRIS

Recent works in historical language processing have shown that transformer-based models can be successfully created using historical corpora, and that using them for analysing and classifying data from the past can be beneficial compared to standard transformer models. This has led to the creation of BERT-like models for different languages trained with digital repositories from the past. In this work we introduce the Italian version of historical BERT, which we call BERToldo. We evaluate the model on the task of PoS-tagging Dante Alighieri’s works, considering not only the tagger performance but also the model size and the time needed to train it. We also address the problem of duplicated data, which is rather common for languages with a limited availability of historical corpora. We show that deduplication reduces training time without affecting performance. The model and its smaller versions are all made available to the research community.

BERToldo, the Historical BERT for Italian / Palmero Aprosio, Alessio; Menini, Stefano; Tonelli, Sara. - (2022), pp. 68-72. (Intervento presentato al convegno Second Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2022) tenutosi a Marseille, France nel 25 June 2022).

BERToldo, the Historical BERT for Italian

Palmero Aprosio, Alessio;Menini, Stefano;Tonelli, Sara

2022-01-01

Abstract

Recent works in historical language processing have shown that transformer-based models can be successfully created using historical corpora, and that using them for analysing and classifying data from the past can be beneficial compared to standard transformer models. This has led to the creation of BERT-like models for different languages trained with digital repositories from the past. In this work we introduce the Italian version of historical BERT, which we call BERToldo. We evaluate the model on the task of PoS-tagging Dante Alighieri’s works, considering not only the tagger performance but also the model size and the time needed to train it. We also address the problem of duplicated data, which is rather common for languages with a limited availability of historical corpora. We show that deduplication reduces training time without affecting performance. The model and its smaller versions are all made available to the research community.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2022
			
	Titolo del volume (Proceedings title)
	
				Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
			
	Luogo di edizione (Place of publication)
	
				Marseille, France
			
	Casa editrice (Publisher)
	
				European Language Resources Association
			
	Tutti gli autori
	
						Palmero Aprosio, Alessio; Menini, Stefano; Tonelli, Sara
					
	Citazione
	
				BERToldo, the Historical BERT for Italian / Palmero Aprosio, Alessio; Menini, Stefano; Tonelli, Sara. - (2022), pp. 68-72. (Intervento presentato al  convegno Second Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2022) tenutosi a Marseille, France nel 25 June 2022).
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
2022.lt4hala-1.10.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Creative commons Dimensione 167.46 kB Formato Adobe PDF Visualizza/Apri	167.46 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/412718

Citazioni

ND

ND

ND

ND

social impact