Rise and Pitfalls of Synthetic Data for Abusive Language Detection

Casula, Camilla

doi:10.15168/11572_436426

Synthetic data has been proposed as a method to potentially mitigate a number of issues with existing models and datasets for abusive language detection online, such as negative psychological impact on annotators, privacy issues, dataset obsolescence and representation bias. However, previous work on the topic has mostly focused on downstream task performance of models, without paying much attention to the evaluation of other aspects. In this thesis, we carry out a series of experiments and analyses on synthetic data for abusive language detection going beyond performance, with the goal of assessing both the potential and the pitfalls of synthetic data from a qualitative point of view. More specifically, we study synthetic data for abusive language detection in English focusing on four aspects: robustness, examining the ability of models trained on synthetic data to generalize to out-of-distribution scenarios; fairness, with an exploration of the representation of identity groups; privacy, exploring the use of entirely synthetic datasets to avoid sharing user-generated data; and finally we consider the quality of the synthetic data, through a manual annotation and analysis of how realistic and representative of real data synthetic data can be with regards to abusive language.

Rise and Pitfalls of Synthetic Data for Abusive Language Detection / Casula, Camilla. - (2024 Oct 28), pp. 1-167. [10.15168/11572_436426]