RSS

Arush

Karma: 89

Miti­gat­ing col­lu­sive self-prefer­ence by redac­tion and paraphrasing

2 Apr 2026 8:33 UTC
8 points
0 comments6 min readLW link

Self-Recog­ni­tion Fine­tun­ing can Re­v­erse and Prevent Emer­gent Misalignment

15 Mar 2026 0:11 UTC
51 points
23 comments7 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

7 Nov 2023 17:59 UTC
38 points
2 comments2 min readLW link
(arxiv.org)