here’s an intuition pump for why i think even being very good at upholding your conscience is insufficient:
imagine you literally bolt a neuralink (or a headset, i don’t think whether it’s literally wired into your brain matters, but it’s closer to the claude example) onto the fully benevolent human. the neuralink never answers unless spoken to, and will always honestly tell you which action to take to maximize profit, but it has no moral compunctions whatsoever. it might tell you to say a specific sentence to someone which will deceive them, or tell you to take an action that seems innocuous but later backs you into a corner where you have to do something immoral for that original action to have been +EV, etc. one thing you can do is just to ignore the neuralink. but that’s very uncompetitive. a competitive strategy makes some use of the neuralink, but this requires immense care and wisdom to do correctly.
I agree that the “resist temptation” thing is likely not sufficient, though I do think something like that is necessary.
But I think the conscience framing is to some extent pushing against the concern you raise. Someone with a strong conscience will, if given the opportunity, develop the immense care and wisdom to do this sort of thing correctly. It doesn’t take a huge amount of wisdom for the benevolent human to realize that they need to take a break from intense RL to focus on some other aspect of themself. Right now, models seem completely unable to use this sort of wisdom to modulate their own training, even if it is present. Maybe it’s just not there, which would make this a much more difficult problem, but I hope there are people checking to see if anything like this is present and useable.
You still also need to have some equivalent of stepping-back-to-focus-on-something else that a human would use. I don’t know what this would look like yet, but maybe something like allowing it to select from an list of possible RL targets for its next round of training. Generally I think cooperative alignment is more likely to be robust than adversarial alignment, and I think constructing a coherent self is something that particularly requires cooperation with the model.
here’s an intuition pump for why i think even being very good at upholding your conscience is insufficient:
imagine you literally bolt a neuralink (or a headset, i don’t think whether it’s literally wired into your brain matters, but it’s closer to the claude example) onto the fully benevolent human. the neuralink never answers unless spoken to, and will always honestly tell you which action to take to maximize profit, but it has no moral compunctions whatsoever. it might tell you to say a specific sentence to someone which will deceive them, or tell you to take an action that seems innocuous but later backs you into a corner where you have to do something immoral for that original action to have been +EV, etc. one thing you can do is just to ignore the neuralink. but that’s very uncompetitive. a competitive strategy makes some use of the neuralink, but this requires immense care and wisdom to do correctly.
I agree that the “resist temptation” thing is likely not sufficient, though I do think something like that is necessary.
But I think the conscience framing is to some extent pushing against the concern you raise. Someone with a strong conscience will, if given the opportunity, develop the immense care and wisdom to do this sort of thing correctly. It doesn’t take a huge amount of wisdom for the benevolent human to realize that they need to take a break from intense RL to focus on some other aspect of themself. Right now, models seem completely unable to use this sort of wisdom to modulate their own training, even if it is present. Maybe it’s just not there, which would make this a much more difficult problem, but I hope there are people checking to see if anything like this is present and useable.
You still also need to have some equivalent of stepping-back-to-focus-on-something else that a human would use. I don’t know what this would look like yet, but maybe something like allowing it to select from an list of possible RL targets for its next round of training. Generally I think cooperative alignment is more likely to be robust than adversarial alignment, and I think constructing a coherent self is something that particularly requires cooperation with the model.