yix comments on We need a better way to evaluate emergent misalignment

yix 12 Jan 2026 13:29 UTC
1 point
0
Thanks for the comment Jan, I enjoyed reading your recent paper on weird generalization. Adding takes and questions in the same order as your bullet points:
- Yes we saw! I came up with the framework after seeing the model organisms of EM paper count responses within the financial domain as EM on a finance model.
  - Perhaps this is less disciplined, but I’m a fan of ‘asking for signal’ from LLMs because it feels more bitter lesson pilled. LLMs can handle most routine cognitive tasks given good conditioning/feedback loop. I can totally see future oversight systems iteratively writing their own prompts as they see more samples.
- Agreed. Though we can say we found EM on the finance dataset, it doesn’t seem like an interesting/important result due to how infrequent the misaligned responses occur. I see EM as an instance of this broader ‘weird generalization’ phenomenon because misalignment is simply a subjective label. SFT summons a persona based on its existing latent knowledge and connections by settling in a loss basin on the SFT dataset that is most easily reachable from its initial weights. For now, I have confused thoughts about how to study this and what it can do for alignment. Some of these include
  - Stronger models (ie, with more robust representations) seem to be more prone to exhibiting EM as you find in, this would make sense to me, though I haven’t seen strong quantitative evidence?
  - I think we’re underleveraging the power of conditioning during all kinds of training. The ‘for education purposes’ modification to the insecure code experiment, backdoor by year, and inoculation prompting are peripheral evidence. Humans have a good ‘value function’ because we always update our gradients with the full context of our lives, current LLM training does not reflect this at all. I think some form of ‘context rich training’ can help with alignment, though I need to gather more evidence and think about exactly what problem this may fix.
- Unseen behavior from unseen trigger is a very interesting result! I wonder if we can ‘ask for generalization’ in future models, though there is probably a gap between ‘showing’ and ‘telling’ in generalization.