Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Arush
Karma:
89
All
Posts
Comments
New
Top
Old
Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment
Arush
,
Shawn Zhou
,
Jiaxin Wen
and
Shi
15 Mar 2026 0:11 UTC
51
points
23
comments
7
min read
LW
link
Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
Soroush Pour
,
rusheb
,
Quentin FEUILLADE--MONTIXI
,
Arush
and
scasper
7 Nov 2023 17:59 UTC
38
points
2
comments
2
min read
LW
link
(arxiv.org)
Back to top