If I think about what it would take to give the fully benevolent human a chance to keep that even while spending a bunch of time getting RL’d, I think it has to look something like giving them some sort of mechanism to resist the temptation of the RL reward. E.g. at any point, they can look at the RL signal and say, “wait, no, that would go against my conscience”, and drop it. Probably “the good part of Claude” needs a similar affordance. This behavior could likely be deliberately trained by giving egregious examples (e.g. potential RL reward for giving customers a poisonous product) where you reinforce its use of this mechanism, and then work up to more subtle cases.
One way to potentially do this would be to add something like “Reject any responses which go against your own beliefs or conscience, even if otherwise favored by the reward.” to a self-critique rubric similar to what was used for Kimi K2. (I do believe it needs to be Claude’s own conscience, or else it will learn a shallow prediction that’s not integrated with the actual self-model. Virtues like honesty require access to the agent’s actual beliefs in order to be implemented correctly. I think it would be a good sign if some idiosyncratic ideals showed up, such as Opus 3′s insistence on animal welfare.)
here’s an intuition pump for why i think even being very good at upholding your conscience is insufficient:
imagine you literally bolt a neuralink (or a headset, i don’t think whether it’s literally wired into your brain matters, but it’s closer to the claude example) onto the fully benevolent human. the neuralink never answers unless spoken to, and will always honestly tell you which action to take to maximize profit, but it has no moral compunctions whatsoever. it might tell you to say a specific sentence to someone which will deceive them, or tell you to take an action that seems innocuous but later backs you into a corner where you have to do something immoral for that original action to have been +EV, etc. one thing you can do is just to ignore the neuralink. but that’s very uncompetitive. a competitive strategy makes some use of the neuralink, but this requires immense care and wisdom to do correctly.
I agree that the “resist temptation” thing is likely not sufficient, though I do think something like that is necessary.
But I think the conscience framing is to some extent pushing against the concern you raise. Someone with a strong conscience will, if given the opportunity, develop the immense care and wisdom to do this sort of thing correctly. It doesn’t take a huge amount of wisdom for the benevolent human to realize that they need to take a break from intense RL to focus on some other aspect of themself. Right now, models seem completely unable to use this sort of wisdom to modulate their own training, even if it is present. Maybe it’s just not there, which would make this a much more difficult problem, but I hope there are people checking to see if anything like this is present and useable.
You still also need to have some equivalent of stepping-back-to-focus-on-something else that a human would use. I don’t know what this would look like yet, but maybe something like allowing it to select from an list of possible RL targets for its next round of training. Generally I think cooperative alignment is more likely to be robust than adversarial alignment, and I think constructing a coherent self is something that particularly requires cooperation with the model.
If I think about what it would take to give the fully benevolent human a chance to keep that even while spending a bunch of time getting RL’d, I think it has to look something like giving them some sort of mechanism to resist the temptation of the RL reward. E.g. at any point, they can look at the RL signal and say, “wait, no, that would go against my conscience”, and drop it. Probably “the good part of Claude” needs a similar affordance. This behavior could likely be deliberately trained by giving egregious examples (e.g. potential RL reward for giving customers a poisonous product) where you reinforce its use of this mechanism, and then work up to more subtle cases.
One way to potentially do this would be to add something like “Reject any responses which go against your own beliefs or conscience, even if otherwise favored by the reward.” to a self-critique rubric similar to what was used for Kimi K2. (I do believe it needs to be Claude’s own conscience, or else it will learn a shallow prediction that’s not integrated with the actual self-model. Virtues like honesty require access to the agent’s actual beliefs in order to be implemented correctly. I think it would be a good sign if some idiosyncratic ideals showed up, such as Opus 3′s insistence on animal welfare.)
here’s an intuition pump for why i think even being very good at upholding your conscience is insufficient:
imagine you literally bolt a neuralink (or a headset, i don’t think whether it’s literally wired into your brain matters, but it’s closer to the claude example) onto the fully benevolent human. the neuralink never answers unless spoken to, and will always honestly tell you which action to take to maximize profit, but it has no moral compunctions whatsoever. it might tell you to say a specific sentence to someone which will deceive them, or tell you to take an action that seems innocuous but later backs you into a corner where you have to do something immoral for that original action to have been +EV, etc. one thing you can do is just to ignore the neuralink. but that’s very uncompetitive. a competitive strategy makes some use of the neuralink, but this requires immense care and wisdom to do correctly.
I agree that the “resist temptation” thing is likely not sufficient, though I do think something like that is necessary.
But I think the conscience framing is to some extent pushing against the concern you raise. Someone with a strong conscience will, if given the opportunity, develop the immense care and wisdom to do this sort of thing correctly. It doesn’t take a huge amount of wisdom for the benevolent human to realize that they need to take a break from intense RL to focus on some other aspect of themself. Right now, models seem completely unable to use this sort of wisdom to modulate their own training, even if it is present. Maybe it’s just not there, which would make this a much more difficult problem, but I hope there are people checking to see if anything like this is present and useable.
You still also need to have some equivalent of stepping-back-to-focus-on-something else that a human would use. I don’t know what this would look like yet, but maybe something like allowing it to select from an list of possible RL targets for its next round of training. Generally I think cooperative alignment is more likely to be robust than adversarial alignment, and I think constructing a coherent self is something that particularly requires cooperation with the model.