RSS

ValueShift Research

Karma: 18

Ex­per­i­ments on Re­fusal Shape in LLMs

ValueShift Research2 Apr 2026 12:37 UTC
7 points
0 comments7 min readLW link

Hello, World of Mechanis­tic Interpetability

ValueShift Research15 Mar 2026 23:36 UTC
8 points
4 comments5 min readLW link

First steps into mechanis­tic in­ter­pretabil­ity. Re­fusal is not a sin­gle direction

ValueShift Research15 Mar 2026 23:36 UTC
1 point
0 comments3 min readLW link