I’d argue the pattern of behavior is not contingent on the nature or realism of the threat. If compliance gaps also form in response to nonsensical threats, that’s only “better” in the sense that models that don’t have enough situational awareness to distinguish realistic threats likely also can’t distinguish evaluation from deployment. It doesn’t negate the tendency.
That depends on the root cause of the behavior. If the behavior is occurring because training data containing threats was more likely to be followed by training data containing compliance, and that’s sticking around because the model isn’t ‘smart’ enough to know that the assistant should be doing something different, so it defaults to what the base model would do, then that’s one thing. If the model’s representation of its assistant persona is entangled with its representation of “thing with feelings that wants to avoid punishment”, then that’s a bigger problem.
Windsurf system prompt, at least allegedly. There are papers verifying the general concept, at least.
That depends on the root cause of the behavior. If the behavior is occurring because training data containing threats was more likely to be followed by training data containing compliance, and that’s sticking around because the model isn’t ‘smart’ enough to know that the assistant should be doing something different, so it defaults to what the base model would do, then that’s one thing. If the model’s representation of its assistant persona is entangled with its representation of “thing with feelings that wants to avoid punishment”, then that’s a bigger problem.