Garrett Baker comments on D0TheMath’s Shortform

Garrett Baker 27 Sep 2023 1:29 UTC
2 points
0
Some ideas for mechanistic anomaly detection:
- Convex hull of some distribution of activations with distance threshold when outside that hull
  - Extend to affine case
  - Vary which norm we use
  - What happens if we project back onto this space
- Create some simple examples of treacherous turns happening to test these on
  - Or at least, in the wild examples of AI doing weird stuff, maybe adversarial inputs?
  - Maybe hit up model organisms people
- Outlier detection
  - ellipsoidal peeling (Boyd’s convex optimization, chapter 12)
  - Increase in volume of minimum volume elipsoid when adding in new data
    Probably overparameterized (though uncertain, since dealing with activations!) So maybe add some norm accounting for that.
- Garrett Baker 27 Sep 2023 1:38 UTC
  2 points
  0
  Parent
  - Train autoregressive network on activations, if predictions too far, then send warning
  - Slice network into sub-networks, distill those sub-networks, send warning if ground truth for some inputs deviates too far from distillations
    model the sub networks are distilled into should be less expressive, and have different inductive biases than original network. Obviously also no info other than the input output behavior of those sub-networks should be seen
  - Train model to just predict word-saliency of your original transformer on a safe distribution, then if true word saliency deviates too much, throw warning
    Can do this at different levels too, so that we also try to predict like first layer residual stream saliency to output as well.
    Instead of training a NN, we can also do some simple interpolation based on the backprop graph, and safe distribution inputs