Andy Arditi comments on Finding Features Causally Upstream of Refusal

Andy Arditi 15 Jan 2025 2:57 UTC
3 points
0
Darn, exactly the project I was hoping to do at MATS! :-)
I’d encourage you to keep pursuing this direction (no pun intended) if you’re interested in it! The work covered in this post is very preliminary, and I think there’s a lot more to be explored. Feel free to reach out, would be happy to coordinate!
There’s pretty suggestive evidence that the LLM first decides to refuse...
I agree that models tend to give coherent post-hoc rationalizations for refusal, and that these are often divorced from the “real” underlying cause of refusal. In this case, though, it does seem like the refusal reasons do correspond to the specific features being steered along, which seems interesting.
Looking through Latent 2213,...
Seems right, nice!