abhayesian

Karma: 489

Introspection Adapters: Training LLMs to Report Their Learned Behaviors

keshavs, RowanWang, abhayesian, Sam Marks and SoerenMind

28 Apr 2026 19:02 UTC

41 points

1 comment12 min readLW link

(alignment.anthropic.com)

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

abhayesian10 Mar 2026 19:31 UTC

81 points

4 comments8 min readLW link

(alignment.anthropic.com)

Open Source Replication of the Auditing Game Model Organism

abhayesian14 Dec 2025 2:10 UTC

24 points

0 comments1 min readLW link

(alignment.anthropic.com)

Why Do Some Language Models Fake Alignment While Others Don’t?

abhayesian, John Hughes, Alex Mallen, Jozdien, janus and Fabien Roger

8 Jul 2025 21:49 UTC

159 points

14 comments5 min readLW link

(arxiv.org)

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

John Hughes, abhayesian, Akbir Khan and Fabien Roger

8 Apr 2025 17:32 UTC

148 points

20 comments12 min readLW link

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

abhayesian, Jannik Brinkmann and Victor Levoso

28 May 2024 5:29 UTC

53 points

1 comment9 min readLW link

(arxiv.org)