RSS

Benjamin Wright

Karma: 607

Agen­tic Misal­ign­ment: How LLMs Could be In­sider Threats

Jun 20, 2025, 10:34 PM
72 points
12 comments6 min readLW link

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
489 points
75 comments10 min readLW link

Eval­u­at­ing Sparse Au­toen­coders with Board Game Models

Aug 2, 2024, 7:50 PM
38 points
1 comment9 min readLW link

Ad­dress­ing Fea­ture Sup­pres­sion in SAEs

Feb 16, 2024, 6:32 PM
87 points
4 comments10 min readLW link