RSS

Sam Marks

Karma: 4,167

Dis­cov­er­ing Back­door Triggers

19 Aug 2025 6:24 UTC
56 points
4 comments13 min readLW link

Towards Align­ment Au­dit­ing as a Num­bers-Go-Up Science

Sam Marks4 Aug 2025 22:30 UTC
114 points
15 comments6 min readLW link

Build­ing and eval­u­at­ing al­ign­ment au­dit­ing agents

24 Jul 2025 19:22 UTC
46 points
1 comment5 min readLW link

Steer­ing Out-of-Distri­bu­tion Gen­er­al­iza­tion with Con­cept Abla­tion Fine-Tuning

23 Jul 2025 14:57 UTC
78 points
3 comments5 min readLW link

Prin­ci­ples for Pick­ing Prac­ti­cal In­ter­pretabil­ity Projects

Sam Marks15 Jul 2025 17:38 UTC
26 points
0 comments13 min readLW link

Race and Gen­der Bias As An Ex­am­ple of Un­faith­ful Chain of Thought in the Wild

2 Jul 2025 16:35 UTC
179 points
25 comments4 min readLW link

Un­su­per­vised Elic­i­ta­tion of Lan­guage Models

13 Jun 2025 16:15 UTC
55 points
9 comments2 min readLW link

Mod­ify­ing LLM Beliefs with Syn­thetic Doc­u­ment Finetuning

24 Apr 2025 21:15 UTC
70 points
12 comments2 min readLW link
(alignment.anthropic.com)

Down­stream ap­pli­ca­tions as val­i­da­tion of in­ter­pretabil­ity progress

Sam Marks31 Mar 2025 1:35 UTC
112 points
3 comments7 min readLW link

Au­dit­ing lan­guage mod­els for hid­den objectives

13 Mar 2025 19:18 UTC
141 points
15 comments13 min readLW link

Recom­men­da­tions for Tech­ni­cal AI Safety Re­search Directions

Sam Marks10 Jan 2025 19:34 UTC
64 points
1 comment17 min readLW link
(alignment.anthropic.com)

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
489 points
75 comments10 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

11 Dec 2024 6:30 UTC
82 points
6 comments2 min readLW link
(www.neuronpedia.org)

Eval­u­at­ing Sparse Au­toen­coders with Board Game Models

2 Aug 2024 19:50 UTC
38 points
1 comment9 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam Marks18 Apr 2024 16:17 UTC
113 points
10 comments12 min readLW link

What’s up with LLMs rep­re­sent­ing XORs of ar­bi­trary fea­tures?

Sam Marks3 Jan 2024 19:44 UTC
158 points
64 comments16 min readLW link

Some open-source dic­tio­nar­ies and dic­tio­nary learn­ing infrastructure

Sam Marks5 Dec 2023 6:05 UTC
46 points
7 comments5 min readLW link

Thoughts on open source AI

Sam Marks3 Nov 2023 15:35 UTC
62 points
17 comments10 min readLW link

Turn­ing off lights with model editing

Sam Marks12 May 2023 20:25 UTC
68 points
5 comments2 min readLW link
(arxiv.org)

[Cross­post] ACX 2022 Pre­dic­tion Con­test Results

24 Jan 2023 6:56 UTC
48 points
6 comments8 min readLW link