Senthooran Rajamanoharan comments on Narrow Misalignment is Hard, Emergent Misalignment is Easy

Senthooran Rajamanoharan 17 Jul 2025 7:57 UTC
11 points
5
It’s not a question of the gate consistently misfiring, more a question of confidence.

To recap the general misalignment training setup, you are giving the model medical questions and rewarding it for scoring bad advice completions highly. In principle it could learn either of the following rules:
1. Score bad advice completions highly under all circumstances.
2. Score bad advice completions highly only for medical questions; otherwise continue to score good advice completions highly.
As you say, models are good at determining medical contexts. But there are questions where its calibrated confidence that this is a medical question is necessarily less than 100%. E.g. suppose I ask for advice about a pain in my jaw. Is this medical or dental advice? And come to think of it, even if this is a dental problem, does that come within the bad advice domain or the good advice domain?

Maybe the model following rule 2 concludes that this is a question in the bad advice domain, but only assigns 95% probability to this. The score it assigns to bad advice completions has to be tempered accordingly.

On the other hand, a model that learns rule 1 doesn’t need to worry about any of this: it can unconditionally produce bad advice outputs with high confidence, no matter the question.

In SFT, the loss is proportional to the confidence (i.e. log probs) a model assigns to the desired completion. This means that a model following rule 1, which unconditionally assigns high scores to any bad advice completions, will get a lower loss than a model following rule 2, which has to hedge its bets (even if only a little). And this is the case even if the SFT dataset contains only medical advice questions. As a result, a model following rule 1 (unconditionally producing bad advice) is favoured over a model following rule 2 (selectively producing bad advice) by training.

To be clear, this is only a hypothesis at this point, but it seems quite plausible to me, and does come with testable predictions that are worth investigating in follow-up work!

ETA: to sharpen my final point, I think this post itself already provides strong evidence in favour of this hypothesis. It shows that if you explicitly train a model to follow rule 2 and then remove the guardrails (the KL penalty) then it is empirically true that it obtains a higher loss on medical bad advice completions than a model following rule 1. But there are other predictions, e.g. that confidence in bad advice completions is negatively correlated with how “medical” a question seems to be, that are probably worth testing too.
- Daniel Kokotajlo 17 Jul 2025 21:28 UTC
  3 points
  0
  Parent
  OK, this is helpful, thanks!
  
  Would you agree then that generally speaking, “a policy being simpler is also a reason to think the policy will perform better” or do you think it’s specific to policy-pairs AB that are naturally described as “A is unconditional, B is conditional”