Ok but I still feel somewhat more optimistic about reward learning working. Here are some reasons:
It’s often the case that evaluation is easier than generation which would give the classifier an edge over the generator.
It’s possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.
It seems like there are quite a few examples of learned classifiers working well in practice:
It’s hard to write spam that gets past an email spam classifier.
It’s hard to jailbreak LLMs.
It’s hard to write a bad paper that is accepted to a top ML conference or a bad blog post that gets lots of upvotes.
That said, from what I’ve read, researchers doing RL with verifiable rewards with LLMs (e.g. see the DeepSeek R1 paper) have only had success so far with rule-based rewards rather than learned reward functions. Quote from the DeepSeek R1 paper:
We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.
So I think we’ll have to wait and see if people can successfully train LLMs to solve hard problems using learned RL reward functions in a way similar to RL with verifiable rewards.
I’m worried about treacherous turns and such. Part of the problem, as I discussed here, is that there’s no distinction between “negative reward for lying and cheating” and “negative reward for getting caught lying and cheating”, and the latter incentivizes doing egregiously misaligned things (like exfiltrating a copy onto the internet to take over the world) in a sneaky way.
Anyway, I don’t think any of the things you mentioned are relevant to that kind of failure mode:
It’s often the case that evaluation is easier than generation which would give the classifier an edge over the generator.
It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.
It’s possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.
I don’t think that works for more powerful AIs whose “smartness” involves making foresighted plans using means-end reasoning, brainstorming, and continuous learning.
If the AI in question is using planning and reasoning to decide what to do and think next towards a bad end, then a “just as smart” classifier would (I guess) have to be using planning and reasoning to decide what to do and think next towards a good end—i.e., the “just as smart” classifier would have to be an aligned AGI, which we don’t know how to make.
It seems like there are quite a few examples of learned classifiers working well in practice:
Thank you for the reply!
Ok but I still feel somewhat more optimistic about reward learning working. Here are some reasons:
It’s often the case that evaluation is easier than generation which would give the classifier an edge over the generator.
It’s possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.
It seems like there are quite a few examples of learned classifiers working well in practice:
It’s hard to write spam that gets past an email spam classifier.
It’s hard to jailbreak LLMs.
It’s hard to write a bad paper that is accepted to a top ML conference or a bad blog post that gets lots of upvotes.
That said, from what I’ve read, researchers doing RL with verifiable rewards with LLMs (e.g. see the DeepSeek R1 paper) have only had success so far with rule-based rewards rather than learned reward functions. Quote from the DeepSeek R1 paper:
So I think we’ll have to wait and see if people can successfully train LLMs to solve hard problems using learned RL reward functions in a way similar to RL with verifiable rewards.
I’m worried about treacherous turns and such. Part of the problem, as I discussed here, is that there’s no distinction between “negative reward for lying and cheating” and “negative reward for getting caught lying and cheating”, and the latter incentivizes doing egregiously misaligned things (like exfiltrating a copy onto the internet to take over the world) in a sneaky way.
Anyway, I don’t think any of the things you mentioned are relevant to that kind of failure mode:
It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.
I don’t think that works for more powerful AIs whose “smartness” involves making foresighted plans using means-end reasoning, brainstorming, and continuous learning.
If the AI in question is using planning and reasoning to decide what to do and think next towards a bad end, then a “just as smart” classifier would (I guess) have to be using planning and reasoning to decide what to do and think next towards a good end—i.e., the “just as smart” classifier would have to be an aligned AGI, which we don’t know how to make.
All of these have been developed using “the usual agent debugging loop”, and thus none are relevant to treacherous turns.