RSS

Andy Arditi

Karma: 743

https://​​andyrdt.com

Find­ing “mis­al­igned per­sona” fea­tures in open-weight models

9 Sep 2025 14:15 UTC
37 points
5 comments15 min readLW link

Fol­low-up ex­per­i­ments on pre­ven­ta­tive steering

6 Sep 2025 4:25 UTC
28 points
1 comment3 min readLW link

Per­sona vec­tors: mon­i­tor­ing and con­trol­ling char­ac­ter traits in lan­guage models

1 Aug 2025 21:19 UTC
25 points
3 comments5 min readLW link
(arxiv.org)

Do mod­els say what they learn?

22 Mar 2025 15:19 UTC
126 points
12 comments13 min readLW link

Find­ing Fea­tures Causally Up­stream of Refusal

14 Jan 2025 2:30 UTC
54 points
5 comments12 min readLW link

AI as sys­tems, not just models

Andy Arditi21 Dec 2024 23:19 UTC
29 points
0 comments7 min readLW link
(andyrdt.com)

Un­learn­ing via RMU is mostly shallow

23 Jul 2024 16:07 UTC
55 points
4 comments6 min readLW link