Warning: This paper contains insulting statements that may cause discomfort for readers. With the development of online social media, quite a few methods focus on automatic Abusive Language Detection (ALD), which requires numerous annotations as the basis for reliable classifier training. However, the labor-intensive, expensive, and time-consuming data labeling process brings difficulties to the acquisition of the annotations. Although some studies have improved the model performance in the absence of labeled data by studying cross-domain generalization and semi-supervised learning, there is still a lack of specific research on making full use of prior knowledge to improve detection effectiveness in the context of limited resources. To solve this problem, we propose a Knowledge Augmented abusive Language Detection framework (KALD), to fully utilize three kinds of prior knowledge: lexical knowledge, sample knowledge, and category knowledge. First, lexicon knowledge is injected into the language model to promote its focus on abusive keyword by context reconstruction. Meanwhile Lexicon-based data augmentation is used to obtain reasonable positive samples necessary for contrastive learning. Subsequently Joint optimization of multi-contrastive learning is applied to encourage language models to learn stable sample-level and in-class representations. The following tasks are performed on the four public datasets to verify the validity of the proposed method (a) ALD (b) semi-supervised ALD And (c) cross-domain abusive language generalization. For semi-supervised ALD, the proposed framework has an average improvement of 2.19% with different sample size settings compared to the most advanced baseline approach and 3.58% compared to the basic language model. For cross-domain abusive language generalization, the proposed framework has an average improvement of 2.58% and 3.42% compared with the most advanced baseline approach and the basic language model, separately.

KALD: A Knowledge Augmented multi-contrastive learning model for low resource abusive Language Detection / Song, Rui; Giunchiglia, Fausto; Li, Yingji; Li, Jian; Wang, Jingwen; Xu, Hao. - In: KNOWLEDGE-BASED SYSTEMS. - ISSN 0950-7051. - 321:(2025). [10.1016/j.knosys.2025.113619]

KALD: A Knowledge Augmented multi-contrastive learning model for low resource abusive Language Detection

Fausto Giunchiglia;Hao Xu
2025-01-01

Abstract

Warning: This paper contains insulting statements that may cause discomfort for readers. With the development of online social media, quite a few methods focus on automatic Abusive Language Detection (ALD), which requires numerous annotations as the basis for reliable classifier training. However, the labor-intensive, expensive, and time-consuming data labeling process brings difficulties to the acquisition of the annotations. Although some studies have improved the model performance in the absence of labeled data by studying cross-domain generalization and semi-supervised learning, there is still a lack of specific research on making full use of prior knowledge to improve detection effectiveness in the context of limited resources. To solve this problem, we propose a Knowledge Augmented abusive Language Detection framework (KALD), to fully utilize three kinds of prior knowledge: lexical knowledge, sample knowledge, and category knowledge. First, lexicon knowledge is injected into the language model to promote its focus on abusive keyword by context reconstruction. Meanwhile Lexicon-based data augmentation is used to obtain reasonable positive samples necessary for contrastive learning. Subsequently Joint optimization of multi-contrastive learning is applied to encourage language models to learn stable sample-level and in-class representations. The following tasks are performed on the four public datasets to verify the validity of the proposed method (a) ALD (b) semi-supervised ALD And (c) cross-domain abusive language generalization. For semi-supervised ALD, the proposed framework has an average improvement of 2.19% with different sample size settings compared to the most advanced baseline approach and 3.58% compared to the basic language model. For cross-domain abusive language generalization, the proposed framework has an average improvement of 2.58% and 3.42% compared with the most advanced baseline approach and the basic language model, separately.
2025
Song, Rui; Giunchiglia, Fausto; Li, Yingji; Li, Jian; Wang, Jingwen; Xu, Hao
KALD: A Knowledge Augmented multi-contrastive learning model for low resource abusive Language Detection / Song, Rui; Giunchiglia, Fausto; Li, Yingji; Li, Jian; Wang, Jingwen; Xu, Hao. - In: KNOWLEDGE-BASED SYSTEMS. - ISSN 0950-7051. - 321:(2025). [10.1016/j.knosys.2025.113619]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/464116
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 2
  • OpenAlex ND
social impact