Cool! This was input distance from a safe set of prompts, right?
Not all all, the distance is in whatever layer’s activation space you are, where ±middle layers work the best. It’s also not e.g. L2, because we were using a “learned” covariance matrix and getting something like a Mahalanobis distance.
Cool! This was input distance from a safe set of prompts, right?
Not all all, the distance is in whatever layer’s activation space you are, where ±middle layers work the best. It’s also not e.g. L2, because we were using a “learned” covariance matrix and getting something like a Mahalanobis distance.