KEVLAR: the Complete Resource for EuroVoc Classification of Legal Documents

Bocchi, Lorenzo; Casula, Camilla; Palmero Aprosio, Alessio

The use of Machine Learning and Artificial Intelligence in the Public Administration (PA) has increased in the last years. In particular, recent guidelines proposed by various governments for the classification of documents released by the PA suggest to use the EuroVoc thesaurus. In this paper, we present KEVLAR, an all-in-one solution for performing the above-mentioned task on acts belonging to the Public Administration. First, we create a collection of 8 million documents in 24 languages, tagged with EuroVoc labels, taken from EUR-Lex, the web portal of the European Union legislation. Then, we train different pre-trained BERT-based models, comparing the performance of base models with domain-specific and multilingual ones. We release the corpus, the best-performing models, and a Docker image containing the source code of the trainer, the REST API, and the web interface. This image can be employed out-of-the-box for document classification.

KEVLAR: the Complete Resource for EuroVoc Classification of Legal Documents / Bocchi, L., Casula, C., Palmero Aprosio, A.. - ELETTRONICO. - 3878:09(2024), pp. 1-8. (10th Italian Conference on Computational Linguistics, CLiC-it 2024 Pisa, Italy December 4-6, 2024).

KEVLAR: the Complete Resource for EuroVoc Classification of Legal Documents

Bocchi Lorenzo;Casula Camilla;Palmero Aprosio Alessio

2024-01-01

Abstract

The use of Machine Learning and Artificial Intelligence in the Public Administration (PA) has increased in the last years. In particular, recent guidelines proposed by various governments for the classification of documents released by the PA suggest to use the EuroVoc thesaurus. In this paper, we present KEVLAR, an all-in-one solution for performing the above-mentioned task on acts belonging to the Public Administration. First, we create a collection of 8 million documents in 24 languages, tagged with EuroVoc labels, taken from EUR-Lex, the web portal of the European Union legislation. Then, we train different pre-trained BERT-based models, comparing the performance of base models with domain-specific and multilingual ones. We release the corpus, the best-performing models, and a Docker image containing the source code of the trainer, the REST API, and the web interface. This image can be employed out-of-the-box for document classification.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2024
			
	Titolo del volume (Proceedings title)
	
				CEUR Workshop Proceedings
			
	Luogo di edizione (Place of publication)
	
				Aachen, Germany
			
	Casa editrice (Publisher)
	
				CEUR-WS
			
	Settori scientifico-disciplinari (validi dal 09/05/2024) - Reference SSD (valid from 09/05/2024)
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
Settore INFO-01/A - Informatica
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85214350458
			
	Tutti gli autori
	
						Bocchi, Lorenzo; Casula, Camilla; Palmero Aprosio, Alessio
					
	Citazione
	
				KEVLAR: the Complete Resource for EuroVoc Classification of Legal Documents / Bocchi, L., Casula, C., Palmero Aprosio, A.. - ELETTRONICO. - 3878:09(2024), pp. 1-8. (10th Italian Conference on Computational Linguistics, CLiC-it 2024 Pisa, Italy December 4-6, 2024).
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
9_main_long.pdf accesso aperto Descrizione: Paper PDF Tipologia: Versione editoriale (Publisher’s layout) Licenza: Creative commons Dimensione 1.48 MB Formato Adobe PDF Visualizza/Apri	1.48 MB	Adobe PDF	Visualizza/Apri