I’m worried about treacherous turns and such. Part of the problem, as I discussed here, is that there’s no distinction between “negative reward for lying and cheating” and “negative reward for getting caught lying and cheating”, and the latter incentivizes doing egregiously misaligned things (like exfiltrating a copy onto the internet to take over the world) in a sneaky way.
Anyway, I don’t think any of the things you mentioned are relevant to that kind of failure mode:
It’s often the case that evaluation is easier than generation which would give the classifier an edge over the generator.
It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.
It’s possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.
I don’t think that works for more powerful AIs whose “smartness” involves making foresighted plans using means-end reasoning, brainstorming, and continuous learning.
If the AI in question is using planning and reasoning to decide what to do and think next towards a bad end, then a “just as smart” classifier would (I guess) have to be using planning and reasoning to decide what to do and think next towards a good end—i.e., the “just as smart” classifier would have to be an aligned AGI, which we don’t know how to make.
It seems like there are quite a few examples of learned classifiers working well in practice:
I’m worried about treacherous turns and such. Part of the problem, as I discussed here, is that there’s no distinction between “negative reward for lying and cheating” and “negative reward for getting caught lying and cheating”, and the latter incentivizes doing egregiously misaligned things (like exfiltrating a copy onto the internet to take over the world) in a sneaky way.
Anyway, I don’t think any of the things you mentioned are relevant to that kind of failure mode:
It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.
I don’t think that works for more powerful AIs whose “smartness” involves making foresighted plans using means-end reasoning, brainstorming, and continuous learning.
If the AI in question is using planning and reasoning to decide what to do and think next towards a bad end, then a “just as smart” classifier would (I guess) have to be using planning and reasoning to decide what to do and think next towards a good end—i.e., the “just as smart” classifier would have to be an aligned AGI, which we don’t know how to make.
All of these have been developed using “the usual agent debugging loop”, and thus none are relevant to treacherous turns.