Ok, so why not just train a model on fake anomaly detection/interp research papers? Fake stories about ‘the bad AI that got caught’, ‘the little AI that overstepped’, etc. I don’t know how to word it, but this seems like something closer to intimidation than alignment, which I don’t think makes much sense as a strategy intended to keep us all alive.
I don’t think this works when the AIs are smart and reasoning in-context, which is the case where scheming matters. Also this maybe backfires by making scheming more salient.
Ok, so why not just train a model on fake anomaly detection/interp research papers? Fake stories about ‘the bad AI that got caught’, ‘the little AI that overstepped’, etc. I don’t know how to word it, but this seems like something closer to intimidation than alignment, which I don’t think makes much sense as a strategy intended to keep us all alive.
I don’t think this works when the AIs are smart and reasoning in-context, which is the case where scheming matters. Also this maybe backfires by making scheming more salient.
Still, might be worth running an experiment.