Jailbreaking (AIs)

TagLast edit: 29 Sep 2024 21:17 UTC by Raemon

Interpreting the effects of Jailbreak Prompts in LLMs

Harsh Raj29 Sep 2024 19:01 UTC

9 points

0 comments5 min readLW link

A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More

Sharat Jacob Jacob29 Oct 2024 12:41 UTC

12 points

0 comments9 min readLW link

Role embeddings: making authorship more salient to LLMs

Nina Panickssery and Christopher Ackerman

7 Jan 2025 20:13 UTC

50 points

0 comments8 min readLW link

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave and Kellin Pelrine

7 Feb 2025 3:57 UTC

37 points

0 comments10 min readLW link

Jailbreaking Claude 4 and Other Frontier Language Models

James Sullivan15 Jun 2025 0:31 UTC

1 point

0 comments3 min readLW link

(open.substack.com)

Philosophical Jailbreaks: Demo of LLM Nihilism

artkpv4 Jun 2025 12:03 UTC

3 points

0 comments5 min readLW link

Intriguing Properties of gpt-oss Jailbreaks

zroe1 and Jack Sanderson

13 Aug 2025 19:42 UTC

14 points

0 comments10 min readLW link

(xlabaisecurity.com)

Can Persuasion Break AI Safety? Exploring the Interplay Between Fine-Tuning, Attacks, and Guardrails

Devina Jain4 Feb 2025 19:10 UTC

9 points

0 comments10 min readLW link

[Question] Using hex to get murder advice from GPT-4o

Laurence Freeman13 Nov 2024 18:30 UTC

10 points

5 comments6 min readLW link

Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)

Archimedes4 Feb 2025 2:55 UTC

17 points

1 comment1 min readLW link

(www.anthropic.com)

Detecting out of distribution text with surprisal and entropy

Sandy Fraser28 Jan 2025 18:46 UTC

18 points

4 comments11 min readLW link

Breaking GPT-OSS: A brief investigation

michaelwaves12 Sep 2025 7:54 UTC

7 points

0 comments3 min readLW link

Jailbreaking ChatGPT and Claude using Web API Context Injection

Jaehyuk Lim21 Oct 2024 21:34 UTC

4 points

0 comments3 min readLW link

The Allure of the Dark Side: A Critical AI Safety Vulnerability I Stumbled Into

Kareem Soliman28 Jul 2025 1:20 UTC

1 point

0 comments4 min readLW link

[Question] Can we control artificial intelligence chatbot by Jailbreak?

آرزو بیرانوند30 Nov 2024 20:47 UTC

1 point

0 comments1 min readLW link

No comments.