RSS

Benjamin Wright

Karma: 887

Towards train­ing-time miti­ga­tions for al­ign­ment fak­ing in RL

16 Dec 2025 21:01 UTC
33 points
1 comment5 min readLW link
(alignment.anthropic.com)

Nat­u­ral emer­gent mis­al­ign­ment from re­ward hack­ing in pro­duc­tion RL

21 Nov 2025 20:00 UTC
261 points
32 comments9 min readLW link

Agen­tic Misal­ign­ment: How LLMs Could be In­sider Threats

20 Jun 2025 22:34 UTC
83 points
13 comments6 min readLW link

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
495 points
85 comments10 min readLW link3 reviews

Eval­u­at­ing Sparse Au­toen­coders with Board Game Models

2 Aug 2024 19:50 UTC
38 points
1 comment9 min readLW link

Ad­dress­ing Fea­ture Sup­pres­sion in SAEs

16 Feb 2024 18:32 UTC
87 points
5 comments10 min readLW link