Even for an SAE that’s been trained only on normal data [...] you could look for circuits in the SAE basis and use those for anomaly detection.
Yeah, this seems somewhat plausible. If automated circuit-finding works it would certainly detect some anomalies, though I’m uncertain if it’s going to be weak against adversarial anomalies relative to regular ol’ random anomalies.
Yeah, this seems somewhat plausible. If automated circuit-finding works it would certainly detect some anomalies, though I’m uncertain if it’s going to be weak against adversarial anomalies relative to regular ol’ random anomalies.