yix comments on We need a better way to evaluate emergent misalignment

yix 13 Jan 2026 8:11 UTC
2 points
1
I did include it in the link dump! Agree that they have stronger results on the delta in misalignment after SFT, though I expect types 1 and 2 to still be counted as misaligned in their method (LLM judge with model spec + question + answer), where the model spec is usually pretty strict. They don’t release misaligned responses which makes it hard to know!