RSS

Jailbreak­ing (AIs)

TagLast edit: 29 Sep 2024 21:17 UTC by Raemon

Mon­i­tor Jailbreak­ing: Evad­ing Chain-of-Thought Mon­i­tor­ing Without En­coded Reasoning

Wuschel Schulz11 Feb 2026 17:18 UTC
59 points
17 comments5 min readLW link

In­ter­pret­ing the effects of Jailbreak Prompts in LLMs

Harsh Raj29 Sep 2024 19:01 UTC
9 points
0 comments5 min readLW link

A Poem Is All You Need: Jailbreak­ing ChatGPT, Meta & More

Sharat Jacob Jacob29 Oct 2024 12:41 UTC
12 points
0 comments9 min readLW link2 reviews

Role em­bed­dings: mak­ing au­thor­ship more salient to LLMs

7 Jan 2025 20:13 UTC
50 points
0 comments8 min readLW link

The Weak­est Model in the Selector

Alice Blair29 Dec 2025 6:55 UTC
13 points
6 comments1 min readLW link

Illu­sory Safety: Redteam­ing Deep­Seek R1 and the Strongest Fine-Tun­able Models of OpenAI, An­thropic, and Google

7 Feb 2025 3:57 UTC
37 points
0 comments10 min readLW link

Jailbreak­ing Claude 4 and Other Fron­tier Lan­guage Models

James Sullivan15 Jun 2025 0:31 UTC
1 point
0 comments3 min readLW link
(open.substack.com)

GDM: Con­sis­tency Train­ing Helps Limit Sy­co­phancy and Jailbreaks in Gem­ini 2.5 Flash

4 Nov 2025 16:25 UTC
53 points
2 comments6 min readLW link
(arxiv.org)

Jailbreak­ing is Em­piri­cal Ev­i­dence for In­ner Misal­ign­ment and Against Align­ment by Default

Jérémy Andréoletti16 Feb 2026 17:49 UTC
51 points
16 comments2 min readLW link

Philo­soph­i­cal Jailbreaks: Demo of LLM Nihilism

Artem Karpov4 Jun 2025 12:03 UTC
3 points
0 comments5 min readLW link

The Tem­po­ral Im­mune Sys­tem: Cross-Ses­sion Be­hav­ioral Mon­i­tor­ing as a Fourth Defense Axis

Daniel Bartz21 Feb 2026 0:18 UTC
1 point
0 comments1 min readLW link

In­trigu­ing Prop­er­ties of gpt-oss Jailbreaks

13 Aug 2025 19:42 UTC
19 points
0 comments10 min readLW link
(xlabaisecurity.com)

Jailbreaks Peak Early, Then Drop: Layer Tra­jec­to­ries in Llama-3.1-70B

James Hoffend27 Dec 2025 12:39 UTC
13 points
0 comments8 min readLW link

Aletheia: A Multi-Agent Frame­work for Mea­sur­ing Cog­ni­tive Diver­gence in Ex­tended-Think­ing LLMs

Saadman Rafat15 Mar 2026 19:34 UTC
1 point
0 comments3 min readLW link

All LLMs, even fron­tier ones, are much more pre­dictable than you might think

Sabatino Vacchiano27 Jan 2026 10:49 UTC
1 point
0 comments2 min readLW link

Can Per­sua­sion Break AI Safety? Ex­plor­ing the In­ter­play Between Fine-Tun­ing, At­tacks, and Guardrails

Devina Jain4 Feb 2025 19:10 UTC
9 points
0 comments10 min readLW link

In­tro­duc­ing the XLab AI Se­cu­rity Guide

27 Dec 2025 16:50 UTC
19 points
1 comment5 min readLW link

[Question] Us­ing hex to get mur­der ad­vice from GPT-4o

Laurence Freeman13 Nov 2024 18:30 UTC
10 points
5 comments6 min readLW link

Con­sti­tu­tional Clas­sifiers: Defend­ing against uni­ver­sal jailbreaks (An­thropic Blog)

Archimedes4 Feb 2025 2:55 UTC
17 points
1 comment1 min readLW link
(www.anthropic.com)

Ex­plor­ing the multi-di­men­sional re­fusal sub­space in rea­son­ing models

Le magicien quantique27 Oct 2025 9:03 UTC
5 points
2 comments4 min readLW link

De­tect­ing out of dis­tri­bu­tion text with sur­prisal and entropy

Sandy Fraser28 Jan 2025 18:46 UTC
24 points
4 comments11 min readLW link

Break­ing GPT-OSS: A brief investigation

michaelwaves12 Sep 2025 7:54 UTC
7 points
0 comments3 min readLW link

Jailbreak­ing ChatGPT and Claude us­ing Web API Con­text Injection

Jaehyuk Lim21 Oct 2024 21:34 UTC
4 points
0 comments3 min readLW link

The Allure of the Dark Side: A Crit­i­cal AI Safety Vuln­er­a­bil­ity I Stum­bled Into

Kareem Soliman28 Jul 2025 1:20 UTC
1 point
0 comments4 min readLW link

Aware­ness of Ma­nipu­la­tion In­creases Jailbreak Vuln­er­a­bil­ity: When LLMs De­clare Guideline Vio­la­tion While Com­mit­ting It

Lee Oren1 Dec 2025 21:01 UTC
1 point
0 comments6 min readLW link

[Question] Can we con­trol ar­tifi­cial in­tel­li­gence chat­bot by Jailbreak?

آرزو بیرانوند30 Nov 2024 20:47 UTC
1 point
0 comments1 min readLW link
No comments.