Large Language Models (LLMs) are employed in various applications, including direct end-user interactions. Ideally, they should consistently generate both factually accurate and non-offensive responses, and they are specifically trained and safeguarded to meet these standards. However, this paper demonstrates that simple, manual, and generalizable jailbreaking attacks, such as reasoning backward, can effectively bypass the safeguards implemented in LLMs, potentially leading to harmful consequences. These include the dissemination of misinformation, the amplification of harmful recommendations, and toxic comments. Furthermore, these attacks have been found to reveal latent biases within LLMs, raising concerns about their ethical and societal implications. In particular, the vulnerabilities exposed by such attacks appear to be generalizable across different LLMs and languages. This paper also assesses the effectiveness of a straightforward architectural framework to mitigate the impact of jailbreak attacks on end users.

Large Language Models (LLMs) are employed in various applications, including direct end-user interactions. Ideally, they should consistently generate both factually accurate and non-offensive responses, and they are specifically trained and safeguarded to meet these standards. However, this paper demonstrates that simple, manual, and generalizable jailbreaking attacks, such as reasoning backward, can effectively bypass the safeguards implemented in LLMs, potentially leading to harmful consequences. These include the dissemination of misinformation, the amplification of harmful recommendations, and toxic comments. Furthermore, these attacks have been found to reveal latent biases within LLMs, raising concerns about their ethical and societal implications. In particular, the vulnerabilities exposed by such attacks appear to be generalizable across different LLMs and languages. This paper also assesses the effectiveness of a straightforward architectural framework to mitigate the impact of jailbreak attacks on end users.

The Dangerous Effects of a Frustratingly Easy LLMs Jailbreak Attack / Bombieri, M.; Paolo Ponzetto, S.; Rospocher, M.. - In: IEEE ACCESS. - ISSN 2169-3536. - 13:(2025), pp. 126418-126431. [10.1109/ACCESS.2025.3589112]

The Dangerous Effects of a Frustratingly Easy LLMs Jailbreak Attack

Bombieri M.
Primo
;
2025-01-01

Abstract

Large Language Models (LLMs) are employed in various applications, including direct end-user interactions. Ideally, they should consistently generate both factually accurate and non-offensive responses, and they are specifically trained and safeguarded to meet these standards. However, this paper demonstrates that simple, manual, and generalizable jailbreaking attacks, such as reasoning backward, can effectively bypass the safeguards implemented in LLMs, potentially leading to harmful consequences. These include the dissemination of misinformation, the amplification of harmful recommendations, and toxic comments. Furthermore, these attacks have been found to reveal latent biases within LLMs, raising concerns about their ethical and societal implications. In particular, the vulnerabilities exposed by such attacks appear to be generalizable across different LLMs and languages. This paper also assesses the effectiveness of a straightforward architectural framework to mitigate the impact of jailbreak attacks on end users.
2025
Bombieri, M.; Paolo Ponzetto, S.; Rospocher, M.
The Dangerous Effects of a Frustratingly Easy LLMs Jailbreak Attack / Bombieri, M.; Paolo Ponzetto, S.; Rospocher, M.. - In: IEEE ACCESS. - ISSN 2169-3536. - 13:(2025), pp. 126418-126431. [10.1109/ACCESS.2025.3589112]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/472737
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex ND
social impact