Erik Jenner comments on Charlie Steiner’s Shortform

Erik Jenner 28 Mar 2024 22:04 UTC
6 points
0
I think this is an important point, but IMO there are at least two types of candidates for using SAEs for anomaly detection (in addition to techniques that make sense for normal, non-sparse autoencoders):
1. Sometimes, you may have a bunch of “untrusted” data, some of which contains anomalies. You just don’t know which data points have anomalies on this untrusted data. (In addition, you have some “trusted” data that is guaranteed not to have anomalies.) Then you could train an SAE on all data (including untrusted) and figure out what “normal” SAE features look like based on the trusted data.
2. Even for an SAE that’s been trained only on normal data, it seems plausible that some correlations between features would be different for anomalous data, and that this might work better than looking for correlations in the dense basis. As an extreme version of this, you could look for circuits in the SAE basis and use those for anomaly detection.
Overall, I think that if SAEs end up being very useful for mech interp, there’s a decent chance they’ll also be useful for (mechanistic) anomaly detection (a lot of my uncertainty about SAEs applies to both possible applications). Definitely uncertain though, e.g. I could imagine SAEs that are useful for discovering interesting stuff about a network manually, but whose features aren’t the right computational units for actually detecting anomalies. I think that would make SAEs less than maximally useful for mech interp too, but probably non-zero useful.
What links here?
- Concrete empirical research projects in mechanistic anomaly detection by Erik Jenner (3 Apr 2024 23:07 UTC; 43 points)
- Charlie Steiner 29 Mar 2024 2:44 UTC
  4 points
  0
  Parent
  Even for an SAE that’s been trained only on normal data [...] you could look for circuits in the SAE basis and use those for anomaly detection.
  Yeah, this seems somewhat plausible. If automated circuit-finding works it would certainly detect some anomalies, though I’m uncertain if it’s going to be weak against adversarial anomalies relative to regular ol’ random anomalies.