Interesting! I used a similar technique for a very particular application, namely detecting harmful inputs into LLMs in Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Section 5.4 (Applying Out-of-Distribution Detection to Reject Strange or Harmful Requests).
Cool! This was input distance from a safe set of prompts, right?
Not all all, the distance is in whatever layer’s activation space you are, where ±middle layers work the best. It’s also not e.g. L2, because we were using a “learned” covariance matrix and getting something like a Mahalanobis distance.
Interesting! I used a similar technique for a very particular application, namely detecting harmful inputs into LLMs in Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Section 5.4 (Applying Out-of-Distribution Detection to Reject Strange or Harmful Requests).
Cool! This was input distance from a safe set of prompts, right?
Not all all, the distance is in whatever layer’s activation space you are, where ±middle layers work the best. It’s also not e.g. L2, because we were using a “learned” covariance matrix and getting something like a Mahalanobis distance.