Nathan Hu

Karma: 156

[Linkpost] Interpreting Language Model Parameters

Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors and Lee Sharkey

5 May 2026 17:37 UTC

164 points

2 comments2 min readLW link

(www.goodfire.ai)

Training on Documents About Reward Hacking Induces Reward Hacking

evhub and Nathan Hu

21 Jan 2025 21:32 UTC

135 points

15 comments2 min readLW link

(alignment.anthropic.com)