Train autoregressive network on activations, if predictions too far, then send warning
Slice network into sub-networks, distill those sub-networks, send warning if ground truth for some inputs deviates too far from distillations
model the sub networks are distilled into should be less expressive, and have different inductive biases than original network. Obviously also no info other than the input output behavior of those sub-networks should be seen
Train model to just predict word-saliency of your original transformer on a safe distribution, then if true word saliency deviates too much, throw warning
Can do this at different levels too, so that we also try to predict like first layer residual stream saliency to output as well.
Instead of training a NN, we can also do some simple interpolation based on the backprop graph, and safe distribution inputs
Some ideas for mechanistic anomaly detection:
Convex hull of some distribution of activations with distance threshold when outside that hull
Extend to affine case
Vary which norm we use
What happens if we project back onto this space
Create some simple examples of treacherous turns happening to test these on
Or at least, in the wild examples of AI doing weird stuff, maybe adversarial inputs?
Maybe hit up model organisms people
Outlier detection
ellipsoidal peeling (Boyd’s convex optimization, chapter 12)
Increase in volume of minimum volume elipsoid when adding in new data
Probably overparameterized (though uncertain, since dealing with activations!) So maybe add some norm accounting for that.
Train autoregressive network on activations, if predictions too far, then send warning
Slice network into sub-networks, distill those sub-networks, send warning if ground truth for some inputs deviates too far from distillations
model the sub networks are distilled into should be less expressive, and have different inductive biases than original network. Obviously also no info other than the input output behavior of those sub-networks should be seen
Train model to just predict word-saliency of your original transformer on a safe distribution, then if true word saliency deviates too much, throw warning
Can do this at different levels too, so that we also try to predict like first layer residual stream saliency to output as well.
Instead of training a NN, we can also do some simple interpolation based on the backprop graph, and safe distribution inputs