RSS

Nathan Hu

Karma: 154

[Linkpost] In­ter­pret­ing Lan­guage Model Parameters

5 May 2026 17:37 UTC
162 points
2 comments2 min readLW link
(www.goodfire.ai)

Train­ing on Doc­u­ments About Re­ward Hack­ing In­duces Re­ward Hacking

21 Jan 2025 21:32 UTC
135 points
15 comments2 min readLW link
(alignment.anthropic.com)