Stanislav Fort comments on An Introduction to Exemplar Partitioning for Mechanistic Interpretability

Stanislav Fort 16 May 2026 17:25 UTC
5 points
0
Interesting! I used a similar technique for a very particular application, namely detecting harmful inputs into LLMs in Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Section 5.4 (Applying Out-of-Distribution Detection to Reject Strange or Harmful Requests).
- Jessica Rumbelow 16 May 2026 20:41 UTC
  2 points
  0
  Parent
  Cool! This was input distance from a safe set of prompts, right?
  - Stanislav Fort 21 May 2026 7:21 UTC
    1 point
    0
    Parent
    Not all all, the distance is in whatever layer’s activation space you are, where ±middle layers work the best. It’s also not e.g. L2, because we were using a “learned” covariance matrix and getting something like a Mahalanobis distance.