Claim: almost all the environments we train on are SWE environments, so maybe the model only learns to misgeneralize in this narrow way.
Have you tried eliciting EM in some coding-like contexts? Or e.g. with the same prompts you use in training? Sounds a bit like a case where conditional misalignment could happen.
I like the concept of a “consistent persona” (perhaps there’s a better name). Roughly—how related are the personas talking in very different contexts? The more consistent they are, the less likely should be unwanted behaviors in some very OOD cases (e.g. jailbreaking).
But the flip side is emergent misalignment. The more consistent are model’s personas, the broader misalignment we should expect (i.e. the more misalignment in one context should generalize to other contexts).
So, the hypothesis I think is quite plausible would be: Claudes just have more consistent personas.