RSS

Sam Marks

Karma: 6,023

Nat­u­ral Lan­guage Au­toen­coders Pro­duce Un­su­per­vised Ex­pla­na­tions of LLM Activations

7 May 2026 20:21 UTC
213 points
35 comments8 min readLW link

Model Spec Mid­train­ing: Im­prov­ing How Align­ment Train­ing Generalizes

5 May 2026 21:55 UTC
71 points
6 comments7 min readLW link
(alignment.anthropic.com)

In­tro­spec­tion Adapters: Train­ing LLMs to Re­port Their Learned Behaviors

28 Apr 2026 19:02 UTC
41 points
1 comment12 min readLW link
(alignment.anthropic.com)

Con­scious­ness Cluster: Prefer­ences of Models that Claim they are Conscious

18 Mar 2026 16:06 UTC
92 points
30 comments5 min readLW link

Cen­sored LLMs as a Nat­u­ral Testbed for Se­cret Knowl­edge Elicitation

9 Mar 2026 18:50 UTC
38 points
3 comments5 min readLW link