RSS

aryaj

Karma: 537

How well do mod­els fol­low their con­sti­tu­tions?

12 Mar 2026 0:07 UTC
94 points
5 comments26 min readLW link

Cen­sored LLMs as a Nat­u­ral Testbed for Se­cret Knowl­edge Elicitation

9 Mar 2026 18:50 UTC
30 points
2 comments5 min readLW link

Cur­rent ac­ti­va­tion or­a­cles are hard to use

3 Mar 2026 19:33 UTC
77 points
3 comments16 min readLW link

mod­els have some pretty funny at­trac­tor states

12 Feb 2026 21:14 UTC
266 points
38 comments18 min readLW link

Test your in­ter­pretabil­ity tech­niques by de-cen­sor­ing Chi­nese models

15 Jan 2026 16:33 UTC
90 points
14 comments20 min readLW link