Charlie Steiner comments on Charlie Steiner’s Shortform

Charlie Steiner 29 Mar 2024 2:44 UTC
4 points
0
Even for an SAE that’s been trained only on normal data [...] you could look for circuits in the SAE basis and use those for anomaly detection.
Yeah, this seems somewhat plausible. If automated circuit-finding works it would certainly detect some anomalies, though I’m uncertain if it’s going to be weak against adversarial anomalies relative to regular ol’ random anomalies.