This paper introduces the MuLTa-Telegram dataset, a Multi- Lingual and multi-Target dataset specifically developed to detect hate speech on Telegram, an understudied yet influential platform in which extremist and fringe content can be found. The dataset contains about 4,000 Telegram messages in Italian and Polish, annotated for the presence of hate speech and its targets, including also target identity group mentions even when no hate is expressed. Unlike most existing hate speech datasets, which focus on a single target group, our dataset is explicitly designed to capture a diverse range of targets, ensuring a broad and representative sample of hateful (and non-hateful) content. Our work addresses the growing need for updated hate speech datasets, as many existing resources are based on platforms that no longer provide research-friendly data access, such as Twitter (X). Crucially, we show that training on existing out-of-domain data leads to poor results on Telegram data, underscoring the necessity of in-domain datasets for effective hate speech detection. We evaluate hate speech classification setups in an extensive series of experiments in both languages, including multilingual, multi-task, and LLM-based approaches. Wefindthat incorporating target information leads to the best performances, enabling multilingual generalization. On the contrary, classification of specific targets shows much room for improvement across setups.

MuLTa-Telegram: A Fine-Grained Italian and Polish Dataset for Hate Speech and Target Detection / Leonardelli, Elisa; Casula, Camilla; Vecellio Salto, Sebastiano; Bak, Joanna Ewa; Muratore, Elisa; Kolos, Anna; Louf, Thomas; Tonelli, Sara. - (2025). ( CLiC-it 2025 - Italian Conference on Computational Linguistics Cagliari, Italy September 24-26, 2025).

MuLTa-Telegram: A Fine-Grained Italian and Polish Dataset for Hate Speech and Target Detection

Leonardelli, Elisa;Casula, Camilla;Vecellio Salto, Sebastiano;Muratore, Elisa;Louf, Thomas;Tonelli, Sara
2025-01-01

Abstract

This paper introduces the MuLTa-Telegram dataset, a Multi- Lingual and multi-Target dataset specifically developed to detect hate speech on Telegram, an understudied yet influential platform in which extremist and fringe content can be found. The dataset contains about 4,000 Telegram messages in Italian and Polish, annotated for the presence of hate speech and its targets, including also target identity group mentions even when no hate is expressed. Unlike most existing hate speech datasets, which focus on a single target group, our dataset is explicitly designed to capture a diverse range of targets, ensuring a broad and representative sample of hateful (and non-hateful) content. Our work addresses the growing need for updated hate speech datasets, as many existing resources are based on platforms that no longer provide research-friendly data access, such as Twitter (X). Crucially, we show that training on existing out-of-domain data leads to poor results on Telegram data, underscoring the necessity of in-domain datasets for effective hate speech detection. We evaluate hate speech classification setups in an extensive series of experiments in both languages, including multilingual, multi-task, and LLM-based approaches. Wefindthat incorporating target information leads to the best performances, enabling multilingual generalization. On the contrary, classification of specific targets shows much room for improvement across setups.
2025
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)
Cagliari, Italy
Italian Conference on Computational Linguistics (CLiC-it 2025)
Leonardelli, Elisa; Casula, Camilla; Vecellio Salto, Sebastiano; Bak, Joanna Ewa; Muratore, Elisa; Kolos, Anna; Louf, Thomas; Tonelli, Sara
MuLTa-Telegram: A Fine-Grained Italian and Polish Dataset for Hate Speech and Target Detection / Leonardelli, Elisa; Casula, Camilla; Vecellio Salto, Sebastiano; Bak, Joanna Ewa; Muratore, Elisa; Kolos, Anna; Louf, Thomas; Tonelli, Sara. - (2025). ( CLiC-it 2025 - Italian Conference on Computational Linguistics Cagliari, Italy September 24-26, 2025).
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/469611
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact