Benjamin Wright

Karma: 880

Towards training-time mitigations for alignment faking in RL

Vlad Mikulik, gasteigerjo, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger and evhub

16 Dec 2025 21:01 UTC

39 points

1 comment5 min readLW link

(alignment.anthropic.com)

Natural emergent misalignment from reward hacking in production RL

evhub, Monte M, Benjamin Wright and Jonathan Uesato

21 Nov 2025 20:00 UTC

258 points

32 comments9 min readLW link

Benjamin Wright 24 Jun 2025 1:07 UTC
1 point
0
in reply to: eggsyntax’s comment on: Agentic Misalignment: How LLMs Could be Insider Threats
I realized I forgot to add it so I added it later, I appreciate the note though!

Agentic Misalignment: How LLMs Could be Insider Threats

Aengus Lynch, Benjamin Wright, Ethan Perez and evhub

20 Jun 2025 22:34 UTC

77 points

13 comments6 min readLW link

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

18 Dec 2024 17:19 UTC

491 points

86 comments10 min readLW link 3 reviews

Evaluating Sparse Autoencoders with Board Game Models

Adam Karvonen, Sam Marks, Can, Benjamin Wright, Jannik Brinkmann, Logan Riggs and Rico Angell

2 Aug 2024 19:50 UTC

38 points

1 comment9 min readLW link

Benjamin Wright 29 Mar 2024 17:20 UTC
2 points
1
on: SAE reconstruction errors are (empirically) pathological
One explanation for pathological errors is feature suppression/feature shrinkage (link). I’d be interested to see if errors are still pathological even if you use the methodology I proposed for finetuning to fix shrinkage. Your method of fixing the norm of the input is close but not quite the same.

Benjamin Wright 16 Feb 2024 22:30 UTC
3 points
0
in reply to: Joseph Bloom’s comment on: Fixing Feature Suppression in SAEs
The original perplexity of the LLM was ~38 on the open web text slice I used. Thanks for the compliments!

Addressing Feature Suppression in SAEs

Benjamin Wright and Lee Sharkey

16 Feb 2024 18:32 UTC

87 points

5 comments10 min readLW link