Maxwell Adam comments on Shortform

Maxwell Adam 1 Feb 2025 9:39 UTC
1 point
0
Ok, so why not just train a model on fake anomaly detection/interp research papers? Fake stories about ‘the bad AI that got caught’, ‘the little AI that overstepped’, etc. I don’t know how to word it, but this seems like something closer to intimidation than alignment, which I don’t think makes much sense as a strategy intended to keep us all alive.
- Cleo Nardo 1 Feb 2025 15:47 UTC
  3 points
  0
  Parent
  I don’t think this works when the AIs are smart and reasoning in-context, which is the case where scheming matters. Also this maybe backfires by making scheming more salient.
  Still, might be worth running an experiment.