RSS

Andy Arditi

Karma: 752

https://​​andyrdt.com

Find­ing “mis­al­igned per­sona” fea­tures in open-weight models

Sep 9, 2025, 2:15 PM
42 points

17 votes

Overall karma indicates overall quality.

5 comments15 min readLW link

Fol­low-up ex­per­i­ments on pre­ven­ta­tive steering

Sep 6, 2025, 4:25 AM
31 points

9 votes

Overall karma indicates overall quality.

1 comment3 min readLW link

Per­sona vec­tors: mon­i­tor­ing and con­trol­ling char­ac­ter traits in lan­guage models

Aug 1, 2025, 9:19 PM
25 points

13 votes

Overall karma indicates overall quality.

3 comments5 min readLW link
(arxiv.org)

Do mod­els say what they learn?

Mar 22, 2025, 3:19 PM
126 points

44 votes

Overall karma indicates overall quality.

12 comments13 min readLW link

Find­ing Fea­tures Causally Up­stream of Refusal

Jan 14, 2025, 2:30 AM
54 points

16 votes

Overall karma indicates overall quality.

5 comments12 min readLW link

AI as sys­tems, not just models

Andy ArditiDec 21, 2024, 11:19 PM
29 points

14 votes

Overall karma indicates overall quality.

0 comments7 min readLW link
(andyrdt.com)

Un­learn­ing via RMU is mostly shallow

Jul 23, 2024, 4:07 PM
55 points

21 votes

Overall karma indicates overall quality.

4 comments6 min readLW link