Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
abhayesian
Karma:
370
Trying to become a shoggoth whisperer
All
Posts
Comments
New
Top
Old
Open Source Replication of the Auditing Game Model Organism
abhayesian
14 Dec 2025 2:10 UTC
20
points
0
comments
1
min read
LW
link
(alignment.anthropic.com)
Why Do Some Language Models Fake Alignment While Others Don’t?
abhayesian
,
John Hughes
,
Alex Mallen
,
Jozdien
,
janus
and
Fabien Roger
8 Jul 2025 21:49 UTC
158
points
14
comments
5
min read
LW
link
(arxiv.org)
Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
John Hughes
,
abhayesian
,
Akbir Khan
and
Fabien Roger
8 Apr 2025 17:32 UTC
147
points
20
comments
12
min read
LW
link
Finding Backward Chaining Circuits in Transformers Trained on Tree Search
abhayesian
,
Jannik Brinkmann
and
Victor Levoso
28 May 2024 5:29 UTC
52
points
1
comment
9
min read
LW
link
(arxiv.org)
Back to top