The most straightforward way to produce evidence of a model’s deception is to find a situation where it changes what it’s doing based on the presence or absence of oversight. If we can find a clear situation where the model’s behavior changes substantially based on the extent to which its behavior is being overseen, that is clear evidence that the model was only pretending to be aligned for the purpose of deceiving the overseer.
This isn’t clearly true to me, at least for one possible interpretation: “If the AI knows a human is present, the AI’s behavior changes in an alignment-relevant way.” I think there’s a substantial chance this isn’t what you had in mind, but I’ll just lay out the following because I think it’ll be useful anyways.
For example, I behave differently around my boss than alone at home, but I’m not pretending to be anything for the purpose of deceiving my boss. More generally, I think it’ll be quite common for different contexts to activate different AI values. Perhaps the AI learned that if the overseer is watching, then it should complete the intended task (like cleaning up a room), because historically those decisions were reinforced by the overseer. This implies certain bad generalization properties, but not purposeful deception.
Yeah, that’s a good point—I agree that the thing I said was a bit too strong. I do think there’s a sense in which the models you’re describing seem pretty deceptive, though, and quite dangerous, given that they have a goal that they only pursue when they think nobody is looking. In fact, the sort of model that you’re describing is exactly the sort of model I would expect a traditional deceptive model to want to gradient hack itself into, since it has the same behavior but is harder to detect.
This isn’t clearly true to me, at least for one possible interpretation: “If the AI knows a human is present, the AI’s behavior changes in an alignment-relevant way.” I think there’s a substantial chance this isn’t what you had in mind, but I’ll just lay out the following because I think it’ll be useful anyways.
For example, I behave differently around my boss than alone at home, but I’m not pretending to be anything for the purpose of deceiving my boss. More generally, I think it’ll be quite common for different contexts to activate different AI values. Perhaps the AI learned that if the overseer is watching, then it should complete the intended task (like cleaning up a room), because historically those decisions were reinforced by the overseer. This implies certain bad generalization properties, but not purposeful deception.
Yeah, that’s a good point—I agree that the thing I said was a bit too strong. I do think there’s a sense in which the models you’re describing seem pretty deceptive, though, and quite dangerous, given that they have a goal that they only pursue when they think nobody is looking. In fact, the sort of model that you’re describing is exactly the sort of model I would expect a traditional deceptive model to want to gradient hack itself into, since it has the same behavior but is harder to detect.