The Dangerous Effects of a Frustratingly Easy LLMs Jailbreak Attack

IRIS

Large Language Models (LLMs) are employed in various applications, including direct end-user interactions. Ideally, they should consistently generate both factually accurate and non-offensive responses, and they are specifically trained and safeguarded to meet these standards. However, this paper demonstrates that simple, manual, and generalizable jailbreaking attacks, such as reasoning backward, can effectively bypass the safeguards implemented in LLMs, potentially leading to harmful consequences. These include the dissemination of misinformation, the amplification of harmful recommendations, and toxic comments. Furthermore, these attacks have been found to reveal latent biases within LLMs, raising concerns about their ethical and societal implications. In particular, the vulnerabilities exposed by such attacks appear to be generalizable across different LLMs and languages. This paper also assesses the effectiveness of a straightforward architectural framework to mitigate the impact of jailbreak attacks on end users.

Large Language Models (LLMs) are employed in various applications, including direct end-user interactions. Ideally, they should consistently generate both factually accurate and non-offensive responses, and they are specifically trained and safeguarded to meet these standards. However, this paper demonstrates that simple, manual, and generalizable jailbreaking attacks, such as reasoning backward, can effectively bypass the safeguards implemented in LLMs, potentially leading to harmful consequences. These include the dissemination of misinformation, the amplification of harmful recommendations, and toxic comments. Furthermore, these attacks have been found to reveal latent biases within LLMs, raising concerns about their ethical and societal implications. In particular, the vulnerabilities exposed by such attacks appear to be generalizable across different LLMs and languages. This paper also assesses the effectiveness of a straightforward architectural framework to mitigate the impact of jailbreak attacks on end users.

The Dangerous Effects of a Frustratingly Easy LLMs Jailbreak Attack / Bombieri, M., Paolo Ponzetto, S., Rospocher, M.. - In: IEEE ACCESS. - ISSN 2169-3536. - 13:(2025), pp. 126418-126431. [10.1109/ACCESS.2025.3589112]

The Dangerous Effects of a Frustratingly Easy LLMs Jailbreak Attack

Bombieri M.^Primo;Paolo Ponzetto S.;Rospocher M.

2025-01-01

Abstract

Large Language Models (LLMs) are employed in various applications, including direct end-user interactions. Ideally, they should consistently generate both factually accurate and non-offensive responses, and they are specifically trained and safeguarded to meet these standards. However, this paper demonstrates that simple, manual, and generalizable jailbreaking attacks, such as reasoning backward, can effectively bypass the safeguards implemented in LLMs, potentially leading to harmful consequences. These include the dissemination of misinformation, the amplification of harmful recommendations, and toxic comments. Furthermore, these attacks have been found to reveal latent biases within LLMs, raising concerns about their ethical and societal implications. In particular, the vulnerabilities exposed by such attacks appear to be generalizable across different LLMs and languages. This paper also assesses the effectiveness of a straightforward architectural framework to mitigate the impact of jailbreak attacks on end users.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2025
			
	Titolo del periodico (Journal title)
	
				IEEE ACCESS
			
	DOI
	
				https://dx.doi.org/10.1109/ACCESS.2025.3589112
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-105011159029
			
	Codice WOS (WOS identifier)
	
				WOS:001534557100004
			
	Tutti gli autori
	
						Bombieri, M.; Paolo Ponzetto, S.; Rospocher, M.
					
	Citazione
	
				The Dangerous Effects of a Frustratingly Easy LLMs Jailbreak Attack / Bombieri, M., Paolo Ponzetto, S., Rospocher, M.. - In: IEEE ACCESS. - ISSN 2169-3536. - 13:(2025), pp. 126418-126431. [10.1109/ACCESS.2025.3589112]

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/472737

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

3

1

3

social impact