RSS

abhayesian

Karma: 445

Trying to become a shoggoth whisperer

Au­ditBench: Eval­u­at­ing Align­ment Au­dit­ing Tech­niques on Models with Hid­den Behaviors

abhayesian10 Mar 2026 19:31 UTC
74 points
4 comments8 min readLW link
(alignment.anthropic.com)

Open Source Repli­ca­tion of the Au­dit­ing Game Model Organism

abhayesian14 Dec 2025 2:10 UTC
24 points
0 comments1 min readLW link
(alignment.anthropic.com)

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

8 Jul 2025 21:49 UTC
158 points
14 comments5 min readLW link
(arxiv.org)

Align­ment Fak­ing Re­vis­ited: Im­proved Clas­sifiers and Open Source Extensions

8 Apr 2025 17:32 UTC
147 points
20 comments12 min readLW link

Find­ing Back­ward Chain­ing Cir­cuits in Trans­form­ers Trained on Tree Search

28 May 2024 5:29 UTC
53 points
1 comment9 min readLW link
(arxiv.org)