Nathan Hu

Karma: 156

[Linkpost] Interpreting Language Model Parameters

Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors and Lee Sharkey

5 May 2026 17:37 UTC

164 points

2 comments2 min readLW link

(www.goodfire.ai)

Nathan Hu 12 Apr 2025 21:21 UTC
3 points
2
in reply to: asksathvik’s comment on: Training on Documents About Reward Hacking Induces Reward Hacking
Apologies for the slow response on this. There was an issue with the link—this should link to the files with the correct access permissions https://drive.google.com/drive/folders/1QUwJTIqwYH2eskaoRtDRgnMt0YHtyDYA.

Nathan Hu 28 Jan 2025 1:04 UTC
LW: 5 AF: 4
4
AF
in reply to: Daniel Kokotajlo’s comment on: Training on Documents About Reward Hacking Induces Reward Hacking
The reduction in reward hacking after SFT or RL on Haiku supports the conjecture that initial conditions matter less than the long run incentives, especially for less capable models. On the other hand, the alignment faking paper shows evidence that capable models can have “value crystallization.” IMO a main takeaway here is that values and personas we might worry about being locked can emerge from pre-taining. A future exciting model organisms project would be to try to show these two effects together (emergent values from pre-training + lock in). Its plausible to me that repeating the above experiments, with some changes to the synthetic documents and starting from a stronger base model, might just work.

Training on Documents About Reward Hacking Induces Reward Hacking

evhub and Nathan Hu

21 Jan 2025 21:32 UTC

135 points

15 comments2 min readLW link

(alignment.anthropic.com)