Most people think “Oh if we have good mech interp then we can catch our AIs scheming, and stop them from harming us”. I think this is mostly true, but there’s another mechanism at play: if we have good mech interp, our AIs are less likely to scheme in the first place, because they will strategically respond to our ability to detect scheming. This also applies to other safety techniques like Redwood-style control protocols.
Good mech interp might stop scheming even if they never catch any scheming, just how good surveillance stops crime even if it never spots any crime.
I think this really depends on what “good” means exactly. For instance, if humans think it’s good but we overestimate how good our interp is, and the AI system knows this, then the AI system can take advantage of our “good” mech interp to scheme more deceptively.
I’m guessing your notion of good must explicitly mean that this scenario isn’t possible. But this really begs the question—how could we know if our mech interp has reached that level of goodness?
Ok, so why not just train a model on fake anomaly detection/interp research papers? Fake stories about ‘the bad AI that got caught’, ‘the little AI that overstepped’, etc. I don’t know how to word it, but this seems like something closer to intimidation than alignment, which I don’t think makes much sense as a strategy intended to keep us all alive.
I don’t think this works when the AIs are smart and reasoning in-context, which is the case where scheming matters. Also this maybe backfires by making scheming more salient.
Most people think “Oh if we have good mech interp then we can catch our AIs scheming, and stop them from harming us”. I think this is mostly true, but there’s another mechanism at play: if we have good mech interp, our AIs are less likely to scheme in the first place, because they will strategically respond to our ability to detect scheming. This also applies to other safety techniques like Redwood-style control protocols.
Good mech interp might stop scheming even if they never catch any scheming, just how good surveillance stops crime even if it never spots any crime.
I think this really depends on what “good” means exactly. For instance, if humans think it’s good but we overestimate how good our interp is, and the AI system knows this, then the AI system can take advantage of our “good” mech interp to scheme more deceptively.
I’m guessing your notion of good must explicitly mean that this scenario isn’t possible. But this really begs the question—how could we know if our mech interp has reached that level of goodness?
Ok, so why not just train a model on fake anomaly detection/interp research papers? Fake stories about ‘the bad AI that got caught’, ‘the little AI that overstepped’, etc. I don’t know how to word it, but this seems like something closer to intimidation than alignment, which I don’t think makes much sense as a strategy intended to keep us all alive.
I don’t think this works when the AIs are smart and reasoning in-context, which is the case where scheming matters. Also this maybe backfires by making scheming more salient.
Still, might be worth running an experiment.