Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Jailbreaking (AIs)
Tag
Last edit:
29 Sep 2024 21:17 UTC
by
Raemon
Relevant
New
Old
Interpreting the effects of Jailbreak Prompts in LLMs
Harsh Raj
29 Sep 2024 19:01 UTC
8
points
0
comments
5
min read
LW
link
A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More
Sharat Jacob Jacob
29 Oct 2024 12:41 UTC
12
points
0
comments
9
min read
LW
link
Role embeddings: making authorship more salient to LLMs
Nina Panickssery
and
Christopher Ackerman
7 Jan 2025 20:13 UTC
50
points
0
comments
8
min read
LW
link
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
ChengCheng
,
Brendan Murphy
,
Adrià Garriga-alonso
,
Yashvardhan Sharma
,
dsbowen
,
smallsilo
,
Yawen Duan
,
ChrisCundy
,
Hannah Betts
,
AdamGleave
and
Kellin Pelrine
7 Feb 2025 3:57 UTC
37
points
0
comments
10
min read
LW
link
Jailbreaking Claude 4 and Other Frontier Language Models
James Sullivan
15 Jun 2025 0:31 UTC
1
point
0
comments
3
min read
LW
link
(open.substack.com)
Philosophical Jailbreaks: Demo of LLM Nihilism
Artyom Karpov
4 Jun 2025 12:03 UTC
3
points
0
comments
5
min read
LW
link
Can Persuasion Break AI Safety? Exploring the Interplay Between Fine-Tuning, Attacks, and Guardrails
Devina Jain
4 Feb 2025 19:10 UTC
9
points
0
comments
10
min read
LW
link
[Question]
Using hex to get murder advice from GPT-4o
Laurence Freeman
13 Nov 2024 18:30 UTC
10
points
5
comments
6
min read
LW
link
Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)
Archimedes
4 Feb 2025 2:55 UTC
17
points
1
comment
1
min read
LW
link
(www.anthropic.com)
Detecting out of distribution text with surprisal and entropy
Sandy Fraser
28 Jan 2025 18:46 UTC
16
points
4
comments
11
min read
LW
link
Jailbreaking ChatGPT and Claude using Web API Context Injection
Jaehyuk Lim
21 Oct 2024 21:34 UTC
4
points
0
comments
3
min read
LW
link
[Question]
Can we control artificial intelligence chatbot by Jailbreak?
آرزو بیرانوند
30 Nov 2024 20:47 UTC
1
point
0
comments
1
min read
LW
link
No comments.
Back to top