RSS

Jailbreak­ing (AIs)

TagLast edit: 29 Sep 2024 21:17 UTC by Raemon

In­ter­pret­ing the effects of Jailbreak Prompts in LLMs

Harsh Raj29 Sep 2024 19:01 UTC
9 points
0 comments5 min readLW link

A Poem Is All You Need: Jailbreak­ing ChatGPT, Meta & More

Sharat Jacob Jacob29 Oct 2024 12:41 UTC
12 points
0 comments9 min readLW link

Role em­bed­dings: mak­ing au­thor­ship more salient to LLMs

7 Jan 2025 20:13 UTC
50 points
0 comments8 min readLW link

Illu­sory Safety: Redteam­ing Deep­Seek R1 and the Strongest Fine-Tun­able Models of OpenAI, An­thropic, and Google

7 Feb 2025 3:57 UTC
37 points
0 comments10 min readLW link

Jailbreak­ing Claude 4 and Other Fron­tier Lan­guage Models

James Sullivan15 Jun 2025 0:31 UTC
1 point
0 comments3 min readLW link
(open.substack.com)

Philo­soph­i­cal Jailbreaks: Demo of LLM Nihilism

artkpv4 Jun 2025 12:03 UTC
3 points
0 comments5 min readLW link

In­trigu­ing Prop­er­ties of gpt-oss Jailbreaks

13 Aug 2025 19:42 UTC
14 points
0 comments10 min readLW link
(xlabaisecurity.com)

Can Per­sua­sion Break AI Safety? Ex­plor­ing the In­ter­play Between Fine-Tun­ing, At­tacks, and Guardrails

Devina Jain4 Feb 2025 19:10 UTC
9 points
0 comments10 min readLW link

[Question] Us­ing hex to get mur­der ad­vice from GPT-4o

Laurence Freeman13 Nov 2024 18:30 UTC
10 points
5 comments6 min readLW link

Con­sti­tu­tional Clas­sifiers: Defend­ing against uni­ver­sal jailbreaks (An­thropic Blog)

Archimedes4 Feb 2025 2:55 UTC
17 points
1 comment1 min readLW link
(www.anthropic.com)

De­tect­ing out of dis­tri­bu­tion text with sur­prisal and entropy

Sandy Fraser28 Jan 2025 18:46 UTC
18 points
4 comments11 min readLW link

Break­ing GPT-OSS: A brief investigation

michaelwaves12 Sep 2025 7:54 UTC
7 points
0 comments3 min readLW link

Jailbreak­ing ChatGPT and Claude us­ing Web API Con­text Injection

Jaehyuk Lim21 Oct 2024 21:34 UTC
4 points
0 comments3 min readLW link

The Allure of the Dark Side: A Crit­i­cal AI Safety Vuln­er­a­bil­ity I Stum­bled Into

Kareem Soliman28 Jul 2025 1:20 UTC
1 point
0 comments4 min readLW link

[Question] Can we con­trol ar­tifi­cial in­tel­li­gence chat­bot by Jailbreak?

آرزو بیرانوند30 Nov 2024 20:47 UTC
1 point
0 comments1 min readLW link
No comments.