Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
RunjinChen
Karma:
103
All
Posts
Comments
New
Top
Old
Finding “misaligned persona” features in open-weight models
Andy Arditi
and
RunjinChen
9 Sep 2025 14:15 UTC
49
points
5
comments
15
min read
LW
link
Follow-up experiments on preventative steering
RunjinChen
and
Andy Arditi
6 Sep 2025 4:25 UTC
34
points
1
comment
3
min read
LW
link
Persona vectors: monitoring and controlling character traits in language models
RunjinChen
and
Andy Arditi
1 Aug 2025 21:19 UTC
26
points
3
comments
5
min read
LW
link
(arxiv.org)
Back to top