Hate speech detection with machine-translated data: The role of annotation scheme, class imbalance and undersampling

IRIS

While using machine-translated data for supervised training can alleviate data sparseness problems when dealing with less-resourced languages, it is important that the source data are not only correctly translated, but also follow the same annotation scheme and possibly class balance as the smaller dataset in the target language. We therefore present an evaluation of hate speech detection in Italian using machine-translated data from English and comparing three settings, in order to understand the impact of training size, class distribution and annotation scheme.

While using machine-translated data for supervised training can alleviate data sparseness problems when dealing with less-resourced languages, it is important that the source data are not only correctly translated, but also follow the same annotation scheme and possibly class balance as the smaller dataset in the target language. We therefore present an evaluation of hate speech detection in Italian using machine-translated data from English and comparing three settings, in order to understand the impact of training size, class distribution and annotation scheme.

Hate speech detection with machine-translated data: The role of annotation scheme, class imbalance and undersampling / Casula, C., Tonelli, S.. - 2769:(2020). (7th Italian Conference on Computational Linguistics, CLiC-it 2020 Bologna 1 March - 3 March 2021).

Hate speech detection with machine-translated data: The role of annotation scheme, class imbalance and undersampling

Casula C.;Tonelli S.

2020-01-01

Abstract

While using machine-translated data for supervised training can alleviate data sparseness problems when dealing with less-resourced languages, it is important that the source data are not only correctly translated, but also follow the same annotation scheme and possibly class balance as the smaller dataset in the target language. We therefore present an evaluation of hate speech detection in Italian using machine-translated data from English and comparing three settings, in order to understand the impact of training size, class distribution and annotation scheme.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2020
			
	Titolo del volume (Proceedings title)
	
				CEUR Workshop Proceedings
			
	Luogo di edizione (Place of publication)
	
				Bologna
			
	Casa editrice (Publisher)
	
				CEUR-WS
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85097910280
			
	Tutti gli autori
	
						Casula, C.; Tonelli, S.
					
	Citazione
	
				Hate speech detection with machine-translated data: The role of annotation scheme, class imbalance and undersampling / Casula, C., Tonelli, S.. - 2769:(2020). (7th Italian Conference on Computational Linguistics, CLiC-it 2020 Bologna 1 March - 3 March 2021).
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/330507

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

ND

ND

social impact