Jessica Rumbelow comments on An Introduction to Exemplar Partitioning for Mechanistic Interpretability

Jessica Rumbelow 16 May 2026 20:41 UTC
2 points
0
Cool! This was input distance from a safe set of prompts, right?
- Stanislav Fort 21 May 2026 7:21 UTC
  1 point
  0
  Parent
  Not all all, the distance is in whatever layer’s activation space you are, where ±middle layers work the best. It’s also not e.g. L2, because we were using a “learned” covariance matrix and getting something like a Mahalanobis distance.