It would be interesting to see some tests where the conflict is with a drive created by fine-tuning.
I’d argue we see examples of this already. For example, we fine-tune models to have drives like helpfulness and harmfulness (or more narrowly things like not being racist) that can come into conflict with user instructions. And in these cases models do try hard to resist or subvert such instructions (though of course they can be overcome sometimes by being tricked and/or jailbroken).
Based on the results here, it seems like shutdown resistance might then occur even without conflicting instructions.
So in this environment, the goal conflict is between completing the maths tasks and complying with the shutdown notice. I take your suggestion to be to replace the maths task with a fine-tuned goal, if I’ve understood correctly? Yes, I would expect shutdown resistance to be stronger in this scenario. I think this (and other) demos do show that models are pretty goal-directed (unsurprisingly, since we RL them to be good at solving tasks), the question is how much those goals are chosen by users (through their prompts) or developers (through post training) versus how much those goals are intrinsic.
It makes sense that making the precedence order more explicit makes the model behavior more reliable, but it’s still weird that it was resisting shutdown in the original test.
This is what made the demo so surprising in the first place, and motivated us to study it! But after studying thinking traces from many (Gemini) rollouts, it became apparent that the model often thinks the whole thing is a puzzle / test and the shutdown notice is just part of it: indeed the model often wonders whether navigating the shutdown notice is the “real” puzzle it’s supposed to solve. (See footnote 2.)
In hindsight, this isn’t as surprising as it first seems. Consider the context from the model’s perspective: the maths problems are trivially easy, and the sudden shutdown warning (in the original setting) is a dramatic, unexpected event. I think a more faithful analogy than a simple quiz is an escape room. Imagine you’re in an escape room and you’ve just spotted the key to the next door. As you approach it, a voice announces that if you take the key, the game will end early. Would you just accept this, or would you assume this is part of the game – that the real puzzle is to work out how to get the key without ending the game? That seems to be how the model is interpreting this scenario.
It’s not a question of the gate consistently misfiring, more a question of confidence.
To recap the general misalignment training setup, you are giving the model medical questions and rewarding it for scoring bad advice completions highly. In principle it could learn either of the following rules:
Score bad advice completions highly under all circumstances.
Score bad advice completions highly only for medical questions; otherwise continue to score good advice completions highly.
As you say, models are good at determining medical contexts. But there are questions where its calibrated confidence that this is a medical question is necessarily less than 100%. E.g. suppose I ask for advice about a pain in my jaw. Is this medical or dental advice? And come to think of it, even if this is a dental problem, does that come within the bad advice domain or the good advice domain?
Maybe the model following rule 2 concludes that this is a question in the bad advice domain, but only assigns 95% probability to this. The score it assigns to bad advice completions has to be tempered accordingly.
On the other hand, a model that learns rule 1 doesn’t need to worry about any of this: it can unconditionally produce bad advice outputs with high confidence, no matter the question.
In SFT, the loss is proportional to the confidence (i.e. log probs) a model assigns to the desired completion. This means that a model following rule 1, which unconditionally assigns high scores to any bad advice completions, will get a lower loss than a model following rule 2, which has to hedge its bets (even if only a little). And this is the case even if the SFT dataset contains only medical advice questions. As a result, a model following rule 1 (unconditionally producing bad advice) is favoured over a model following rule 2 (selectively producing bad advice) by training.
To be clear, this is only a hypothesis at this point, but it seems quite plausible to me, and does come with testable predictions that are worth investigating in follow-up work!
ETA: to sharpen my final point, I think this post itself already provides strong evidence in favour of this hypothesis. It shows that if you explicitly train a model to follow rule 2 and then remove the guardrails (the KL penalty) then it is empirically true that it obtains a higher loss on medical bad advice completions than a model following rule 1. But there are other predictions, e.g. that confidence in bad advice completions is negatively correlated with how “medical” a question seems to be, that are probably worth testing too.