Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Kellin Pelrine
Karma:
164
All
Posts
Comments
New
Top
Old
Investigating Accidental Misalignment: Causal Effects of Fine-Tuning Data on Model Vulnerability
Zhijing Jin
,
Punya Syon Pandey
,
samuelsimko
and
Kellin Pelrine
11 Jun 2025 19:30 UTC
6
points
0
comments
5
min read
LW
link
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
ChengCheng
,
Brendan Murphy
,
Adrià Garriga-alonso
,
Yashvardhan Sharma
,
dsbowen
,
smallsilo
,
Yawen Duan
,
ChrisCundy
,
Hannah Betts
,
AdamGleave
and
Kellin Pelrine
7 Feb 2025 3:57 UTC
37
points
0
comments
10
min read
LW
link
GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
ChengCheng
,
Brendan Murphy
,
AdamGleave
and
Kellin Pelrine
1 Nov 2024 0:10 UTC
18
points
0
comments
6
min read
LW
link
(far.ai)
Even Superhuman Go AIs Have Surprising Failure Modes
AdamGleave
,
EuanMcLean
,
Tony Wang
,
Kellin Pelrine
,
Tom Tseng
,
Yawen Duan
,
Joseph Miller
and
MichaelDennis
20 Jul 2023 17:31 UTC
130
points
22
comments
10
min read
LW
link
(far.ai)
Back to top