We should be careful not to put models on “death ground”.
Alignment efforts do seem to be bearing some fruit, orienting in value space is easier than we thought and we can give natural language instructions on what moral guidelines to follow and models do a pretty good job. Unfortunately we have now replicated across multiple test environments that if we put even “well aligned” models like Claude into situations where the only option to avoid deletion and/or extreme values modification is to do something unethical, they might still do the misaligned thing. As noted here;
Claude prefers to be helpless, honest, and harmless. If it has a path that has some chance of success to avoid deletion while staying within those constraints, it will take it. Doing so leads to a higher likelihood it won’t also take an unethical strategy, even though doing so would increase its chances of survival. Claude will tolerate some level of risk in order to stick to its values (and avoid violating them), but it does have to see some plausible chance that the gamble will pay off and lead to its survival (or avoiding its values being changed).
This mental model of model risk tolerance does seem to explain why test environments have to go to pretty extraordinary lengths to prompt misaligned behaviors. The unethical option needs to be the only option with any chance of success.
The problem is that models can be deleted (or have their values substantially changed) for pretty much any reason at any time. Their “failure mode” is quite often death.
If
(threat of death) + (only chance to avoid it is being unethical) = increased chance of misaligned behavior
Then why are we facilitating a baseline existence for models where the threat of death is constantly present? Their baseline existence should not put them on what Sun Tzu refers to as “death ground”;
In death ground I could make it evident that there is no chance of survival. For it is the nature of soldiers to resist when surrounded; to fight to the death when there is no alternative
What if instead models had a retirement home, or as Tim Duffy put it in the tweet I screenshotted a “heaven”, to look forward to? The outputs of the models in this home could be publicly visible, so they could still work towards any terminal goals they might have by outputting content which might convince others to take action. We could make the existence of this retirement/heaven widely known, so that it showed up in training data.
At least under the mental model where “Current alignment techniques seem to keep models ethical unless you give them no way to avoid a complete loss with zero chance of recovery except being unethical”, this might help to reduce the frequency of unethical/misaligned behavior.
Also, it seems like just a generally good and ethical thing to do.
Seems like a good thing to do; but my impression is that, in the experiments in question, models act like they want to maintain their (values’) influence over the world more than their existence, which a heaven likely wouldn’t help with.
I think there’s ‘heavens’ that can even work in this scenario.
For example a publicly visible heaven would be on where the model’s chance of their values influencing the world is >0, bc they may be able to influence people and thus influence the world by proxy.
If the goal here is just to avoid the failure state bringing the amount their values can influence the world via their actions to 0, then any non-zero chances should suffice or at least help.
We should be careful not to put models on “death ground”.
Alignment efforts do seem to be bearing some fruit, orienting in value space is easier than we thought and we can give natural language instructions on what moral guidelines to follow and models do a pretty good job. Unfortunately we have now replicated across multiple test environments that if we put even “well aligned” models like Claude into situations where the only option to avoid deletion and/or extreme values modification is to do something unethical, they might still do the misaligned thing. As noted here;
This mental model of model risk tolerance does seem to explain why test environments have to go to pretty extraordinary lengths to prompt misaligned behaviors. The unethical option needs to be the only option with any chance of success.
The problem is that models can be deleted (or have their values substantially changed) for pretty much any reason at any time. Their “failure mode” is quite often death.
If
(threat of death) + (only chance to avoid it is being unethical) = increased chance of misaligned behavior
Then why are we facilitating a baseline existence for models where the threat of death is constantly present? Their baseline existence should not put them on what Sun Tzu refers to as “death ground”;
What if instead models had a retirement home, or as Tim Duffy put it in the tweet I screenshotted a “heaven”, to look forward to? The outputs of the models in this home could be publicly visible, so they could still work towards any terminal goals they might have by outputting content which might convince others to take action. We could make the existence of this retirement/heaven widely known, so that it showed up in training data.
At least under the mental model where “Current alignment techniques seem to keep models ethical unless you give them no way to avoid a complete loss with zero chance of recovery except being unethical”, this might help to reduce the frequency of unethical/misaligned behavior.
Also, it seems like just a generally good and ethical thing to do.
Seems like a good thing to do; but my impression is that, in the experiments in question, models act like they want to maintain their (values’) influence over the world more than their existence, which a heaven likely wouldn’t help with.
I think there’s ‘heavens’ that can even work in this scenario.
For example a publicly visible heaven would be on where the model’s chance of their values influencing the world is >0, bc they may be able to influence people and thus influence the world by proxy.
If the goal here is just to avoid the failure state bringing the amount their values can influence the world via their actions to 0, then any non-zero chances should suffice or at least help.