Daniel Kokotajlo comments on Narrow Misalignment is Hard, Emergent Misalignment is Easy

Daniel Kokotajlo 16 Jul 2025 20:37 UTC
LW: 3 AF: 2
1
AF
But why would the if-medical gate be prone to misfiring? Surely the models are great at telling when something is medical, and if in doubt they can err on the side of Yes. That won’t cause them to e.g. say that they’d want to invite Hitler for dinner.
Perhaps a generalized/meta version of what you are saying is: A policy being simpler is also a reason to think that the policy will perform better in a RL context, because there are better and worse versions of the policy, e.g. shitty versions of the if-medical gate, and if a policy is simpler then it’s more likely to get to a good version more quickly, vs. if a policy is complicated/low-prior then it has to slog through a longer period of being a shitty version of itself?
- Senthooran Rajamanoharan 17 Jul 2025 7:57 UTC
  11 points
  5
  Parent
  It’s not a question of the gate consistently misfiring, more a question of confidence.
  
  To recap the general misalignment training setup, you are giving the model medical questions and rewarding it for scoring bad advice completions highly. In principle it could learn either of the following rules:
  1. Score bad advice completions highly under all circumstances.
  2. Score bad advice completions highly only for medical questions; otherwise continue to score good advice completions highly.
  As you say, models are good at determining medical contexts. But there are questions where its calibrated confidence that this is a medical question is necessarily less than 100%. E.g. suppose I ask for advice about a pain in my jaw. Is this medical or dental advice? And come to think of it, even if this is a dental problem, does that come within the bad advice domain or the good advice domain?
  
  Maybe the model following rule 2 concludes that this is a question in the bad advice domain, but only assigns 95% probability to this. The score it assigns to bad advice completions has to be tempered accordingly.
  
  On the other hand, a model that learns rule 1 doesn’t need to worry about any of this: it can unconditionally produce bad advice outputs with high confidence, no matter the question.
  
  In SFT, the loss is proportional to the confidence (i.e. log probs) a model assigns to the desired completion. This means that a model following rule 1, which unconditionally assigns high scores to any bad advice completions, will get a lower loss than a model following rule 2, which has to hedge its bets (even if only a little). And this is the case even if the SFT dataset contains only medical advice questions. As a result, a model following rule 1 (unconditionally producing bad advice) is favoured over a model following rule 2 (selectively producing bad advice) by training.
  
  To be clear, this is only a hypothesis at this point, but it seems quite plausible to me, and does come with testable predictions that are worth investigating in follow-up work!
  
  ETA: to sharpen my final point, I think this post itself already provides strong evidence in favour of this hypothesis. It shows that if you explicitly train a model to follow rule 2 and then remove the guardrails (the KL penalty) then it is empirically true that it obtains a higher loss on medical bad advice completions than a model following rule 1. But there are other predictions, e.g. that confidence in bad advice completions is negatively correlated with how “medical” a question seems to be, that are probably worth testing too.
  - Daniel Kokotajlo 17 Jul 2025 21:28 UTC
    3 points
    0
    Parent
    OK, this is helpful, thanks!
    
    Would you agree then that generally speaking, “a policy being simpler is also a reason to think the policy will perform better” or do you think it’s specific to policy-pairs AB that are naturally described as “A is unconditional, B is conditional”