RSS

Jailbreak­ing (AIs)

TagLast edit: 29 Sep 2024 21:17 UTC by Raemon

In­ter­pret­ing the effects of Jailbreak Prompts in LLMs

Harsh Raj29 Sep 2024 19:01 UTC
8 points
0 comments5 min readLW link

A Poem Is All You Need: Jailbreak­ing ChatGPT, Meta & More

Sharat Jacob Jacob29 Oct 2024 12:41 UTC
12 points
0 comments9 min readLW link

Role em­bed­dings: mak­ing au­thor­ship more salient to LLMs

7 Jan 2025 20:13 UTC
50 points
0 comments8 min readLW link

Illu­sory Safety: Redteam­ing Deep­Seek R1 and the Strongest Fine-Tun­able Models of OpenAI, An­thropic, and Google

7 Feb 2025 3:57 UTC
37 points
0 comments10 min readLW link

Jailbreak­ing Claude 4 and Other Fron­tier Lan­guage Models

James Sullivan15 Jun 2025 0:31 UTC
1 point
0 comments3 min readLW link
(open.substack.com)

Philo­soph­i­cal Jailbreaks: Demo of LLM Nihilism

Artyom Karpov4 Jun 2025 12:03 UTC
3 points
0 comments5 min readLW link

Can Per­sua­sion Break AI Safety? Ex­plor­ing the In­ter­play Between Fine-Tun­ing, At­tacks, and Guardrails

Devina Jain4 Feb 2025 19:10 UTC
9 points
0 comments10 min readLW link

[Question] Us­ing hex to get mur­der ad­vice from GPT-4o

Laurence Freeman13 Nov 2024 18:30 UTC
10 points
5 comments6 min readLW link

Con­sti­tu­tional Clas­sifiers: Defend­ing against uni­ver­sal jailbreaks (An­thropic Blog)

Archimedes4 Feb 2025 2:55 UTC
17 points
1 comment1 min readLW link
(www.anthropic.com)

De­tect­ing out of dis­tri­bu­tion text with sur­prisal and entropy

Sandy Fraser28 Jan 2025 18:46 UTC
16 points
4 comments11 min readLW link

Jailbreak­ing ChatGPT and Claude us­ing Web API Con­text Injection

Jaehyuk Lim21 Oct 2024 21:34 UTC
4 points
0 comments3 min readLW link

[Question] Can we con­trol ar­tifi­cial in­tel­li­gence chat­bot by Jailbreak?

آرزو بیرانوند30 Nov 2024 20:47 UTC
1 point
0 comments1 min readLW link
No comments.