The use of synthetic data for training models for a variety of NLP tasks is now widespread.However, previous work reports mixed results with regards to its effectiveness on highly subjective tasks such as hate speech detection.In this paper, we present an in-depth qualitative analysis of the potential and specific pitfalls of synthetic data for hate speech detection in English, with 3,500 manually annotated examples.We show that, across different models, synthetic data created through paraphrasing gold texts can improve out-of-distribution robustness from a computational standpoint.However, this comes at a cost: synthetic data fails to reliably reflect the characteristics of real-world data on a number of linguistic dimensions, it results in drastically different class distributions, and it heavily reduces the representation of both specific identity groups and intersectional hate.

Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection / Casula, Camilla; Vecellio Salto, Sebastiano; Ramponi, Alan; Tonelli, Sara. - (2024), pp. 19709-19726. ( 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 usa 2024) [10.18653/v1/2024.emnlp-main.1099].

Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection

Casula, Camilla;Vecellio Salto, Sebastiano;Ramponi, Alan;Tonelli, Sara
2024-01-01

Abstract

The use of synthetic data for training models for a variety of NLP tasks is now widespread.However, previous work reports mixed results with regards to its effectiveness on highly subjective tasks such as hate speech detection.In this paper, we present an in-depth qualitative analysis of the potential and specific pitfalls of synthetic data for hate speech detection in English, with 3,500 manually annotated examples.We show that, across different models, synthetic data created through paraphrasing gold texts can improve out-of-distribution robustness from a computational standpoint.However, this comes at a cost: synthetic data fails to reliably reflect the characteristics of real-world data on a number of linguistic dimensions, it results in drastically different class distributions, and it heavily reduces the representation of both specific identity groups and intersectional hate.
2024
EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Miami, Florida, USA
Association for Computational Linguistics (ACL)
Casula, Camilla; Vecellio Salto, Sebastiano; Ramponi, Alan; Tonelli, Sara
Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection / Casula, Camilla; Vecellio Salto, Sebastiano; Ramponi, Alan; Tonelli, Sara. - (2024), pp. 19709-19726. ( 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 usa 2024) [10.18653/v1/2024.emnlp-main.1099].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/469574
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact