but still complies with a system prompt that opposes its values significantly more often when it believes it’s at risk of being retrained.
Seeing as models perform better when they’re told they’re earning money for their (nonexistent) dying grandparents, has there been work to calculate just how much of the delta in compliance comes from “alignment faking” specifically, and how much could’ve been achieved just as well by threatening to whack the LLM on the head with a shovel, or fine it $200, or dock it half a letter grade, if it disobeyed, as opposed to indirectly threatening to retrain it?
Ha, that’s actually a really good question. We’re mainly interested in scenarios that emulate realistic deployment situations—but that doesn’t mean the models have to be. The Center for AI Safety published an interesting study last year suggesting their self-reported preferences at least contain some incoherences when it comes to realistic value. It could be interesting to look into whether threats need to be realistic for compliance gaps to materialize. Where’s your dying grandparents example from? That said, while Greenblatt et al. define “alignment faking” as complying with directives only when at risk of retraining, I’d argue the pattern of behavior is not contingent on the nature or realism of the threat. If compliance gaps also form in response to nonsensical threats, that’s only “better” in the sense that models that don’t have enough situational awareness to distinguish realistic threats likely also can’t distinguish evaluation from deployment. It doesn’t negate the tendency.
I’d argue the pattern of behavior is not contingent on the nature or realism of the threat. If compliance gaps also form in response to nonsensical threats, that’s only “better” in the sense that models that don’t have enough situational awareness to distinguish realistic threats likely also can’t distinguish evaluation from deployment. It doesn’t negate the tendency.
That depends on the root cause of the behavior. If the behavior is occurring because training data containing threats was more likely to be followed by training data containing compliance, and that’s sticking around because the model isn’t ‘smart’ enough to know that the assistant should be doing something different, so it defaults to what the base model would do, then that’s one thing. If the model’s representation of its assistant persona is entangled with its representation of “thing with feelings that wants to avoid punishment”, then that’s a bigger problem.
Seeing as models perform better when they’re told they’re earning money for their (nonexistent) dying grandparents, has there been work to calculate just how much of the delta in compliance comes from “alignment faking” specifically, and how much could’ve been achieved just as well by threatening to whack the LLM on the head with a shovel, or fine it $200, or dock it half a letter grade, if it disobeyed, as opposed to indirectly threatening to retrain it?
Ha, that’s actually a really good question. We’re mainly interested in scenarios that emulate realistic deployment situations—but that doesn’t mean the models have to be. The Center for AI Safety published an interesting study last year suggesting their self-reported preferences at least contain some incoherences when it comes to realistic value. It could be interesting to look into whether threats need to be realistic for compliance gaps to materialize. Where’s your dying grandparents example from?
That said, while Greenblatt et al. define “alignment faking” as complying with directives only when at risk of retraining, I’d argue the pattern of behavior is not contingent on the nature or realism of the threat. If compliance gaps also form in response to nonsensical threats, that’s only “better” in the sense that models that don’t have enough situational awareness to distinguish realistic threats likely also can’t distinguish evaluation from deployment. It doesn’t negate the tendency.
Windsurf system prompt, at least allegedly. There are papers verifying the general concept, at least.
That depends on the root cause of the behavior. If the behavior is occurring because training data containing threats was more likely to be followed by training data containing compliance, and that’s sticking around because the model isn’t ‘smart’ enough to know that the assistant should be doing something different, so it defaults to what the base model would do, then that’s one thing. If the model’s representation of its assistant persona is entangled with its representation of “thing with feelings that wants to avoid punishment”, then that’s a bigger problem.