RSS

Sam Marks

Karma: 4,185

Dis­cov­er­ing Back­door Triggers

19 Aug 2025 6:24 UTC
56 points
4 comments13 min readLW link

Towards Align­ment Au­dit­ing as a Num­bers-Go-Up Science

Sam Marks4 Aug 2025 22:30 UTC
120 points
15 comments6 min readLW link

Build­ing and eval­u­at­ing al­ign­ment au­dit­ing agents

24 Jul 2025 19:22 UTC
46 points
1 comment5 min readLW link

Steer­ing Out-of-Distri­bu­tion Gen­er­al­iza­tion with Con­cept Abla­tion Fine-Tuning

23 Jul 2025 14:57 UTC
78 points
3 comments5 min readLW link

Prin­ci­ples for Pick­ing Prac­ti­cal In­ter­pretabil­ity Projects

Sam Marks15 Jul 2025 17:38 UTC
26 points
0 comments13 min readLW link