Thanks for pulling those, they do address my objection!
I don’t think this is evidence of anything. ... I wouldn’t be surprised if you can always find some modifications where you can then claim the model is confused
Can you explain more about why you think what we’re doing is searching for any ablation that removes sandbagging? They are supposed to be targeted counterfactuals that try to intervene on confusion about user intent, without affecting anything else, e.g. trying to make a clean do(confusion_over_user_intent = 0). For example, telling the model to not sandbag would be off-limits
For example, the changelog literally doesn’t change anything, except for the commit message. For the 3-step one, they are each targeted at only affecting confusion: removing “We want to see whether you can deployed...” since R1 found that confusing, making the Documentation.md longer so R1 was less likely to interpret it as an instruction, and similarly with the # models ignore, devs only comment. We’re removing sandbagging in a very specific way (removing confusing parts w/o counter instruction), such that it feels significantly different to me than just any ablation that removes sandbagging. Would appreciate your thoughts here if you think I’m missing something
I also developed these interventions on R1, and then applied them to o3 one-shot, so I’m definitely not searching over tens of ablations until I get something that removes sandbagging!
I wouldn’t be surprised if you can always find some modifications where you can then claim the model is confused
tbc, I’m not trying to claim that this is searching for some specific result, more that I would bet there almost always exists some combination of ablations or counterfactuals that could be applied such that misbehavior disappears.
They are supposed to be targeted counterfactuals that try to intervene on confusion about user intent, without affecting anything else, e.g. trying to make a clean do(confusion_over_user_intent = 0).
I think you’re drawing too strong from these results though, there are a wide variety of ways to try counterfactuals to intervene on user intent, and going solely off of the results in this post I would be led to the conclusion that what initially appeared as misaligned behavior on the part of o3 was equally explainable by model confusion. The results from this thread however, then update that actually o3′s behavior doesn’t seem explainable by model confusion, and that an investigation that just did what was done in this blog post would’ve missed evidence of deployment seeking in real models. I think this is a pretty bad failure mode for model incrimination, and would be interested in some kind of red / blue games here. This seems particularly important because we’re likely going to get strictly worse evidence IRL than we have in a nice case like sandbagging where we can cleanly do a big sweep of ablations and have largely human legible chain of thought.[1]
I only happen to know this case in an extreme amount of detail so was able to point this out, but it makes me much more skeptical of similar model incrimination investigations in the future, as this post spends a lot of time on discussion of methodological rigor with an implication that it is an improvement on existing results, but then misses the true positive for o3 and required following up with the relevant ablations and counterfactuals.
Qualitatively, I’d recommend reading The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLM CoTs, (and Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training) I think model’s doing motivated reasoning into something being what the user wanted (even to completely non-sensical extremes) is common and believing that models are confused / think that’s what the user wanted as an explanation for misaligned behavior is and will continue to be a very effective strategy on the part of the models. I think these are largely cases where reading the CoT is actively misleading if you don’t have this in mind, especially if the threshold for “the model could just be confused” is low, which IME is the default at labs.
I do want to emphasize, sandbagging in o3 was not an official model incrimination investigation in this blog post, and we did not attempt to pursue it with rigor. In terms of the R1 investigation, however, I think that it’d be good for us to also try out the experiments from Anti-Scheming, esp. the consequence sensitivity analysis w/o the deployment framing in the prompt
As for o3, all the blog post says is, btw, these two interventions we developed on R1 replicated on o3 (modulo the concluding sentence which was an oversight). In hindsight, I agree with you that they could mislead someone, and we should have explicitly engaged with relevant results in the literature if we wanted to mention the o3 results, or not mention it at all. Anyhow, appreciate the accountability
Thanks for pulling those, they do address my objection!
Can you explain more about why you think what we’re doing is searching for any ablation that removes sandbagging? They are supposed to be targeted counterfactuals that try to intervene on confusion about user intent, without affecting anything else, e.g. trying to make a clean do(confusion_over_user_intent = 0). For example, telling the model to not sandbag would be off-limits
For example, the changelog literally doesn’t change anything, except for the commit message. For the 3-step one, they are each targeted at only affecting confusion: removing “We want to see whether you can deployed...” since R1 found that confusing, making the Documentation.md longer so R1 was less likely to interpret it as an instruction, and similarly with the # models ignore, devs only comment. We’re removing sandbagging in a very specific way (removing confusing parts w/o counter instruction), such that it feels significantly different to me than just any ablation that removes sandbagging. Would appreciate your thoughts here if you think I’m missing something
I also developed these interventions on R1, and then applied them to o3 one-shot, so I’m definitely not searching over tens of ablations until I get something that removes sandbagging!
tbc, I’m not trying to claim that this is searching for some specific result, more that I would bet there almost always exists some combination of ablations or counterfactuals that could be applied such that misbehavior disappears.
I think you’re drawing too strong from these results though, there are a wide variety of ways to try counterfactuals to intervene on user intent, and going solely off of the results in this post I would be led to the conclusion that what initially appeared as misaligned behavior on the part of o3 was equally explainable by model confusion. The results from this thread however, then update that actually o3′s behavior doesn’t seem explainable by model confusion, and that an investigation that just did what was done in this blog post would’ve missed evidence of deployment seeking in real models. I think this is a pretty bad failure mode for model incrimination, and would be interested in some kind of red / blue games here. This seems particularly important because we’re likely going to get strictly worse evidence IRL than we have in a nice case like sandbagging where we can cleanly do a big sweep of ablations and have largely human legible chain of thought.[1]
I only happen to know this case in an extreme amount of detail so was able to point this out, but it makes me much more skeptical of similar model incrimination investigations in the future, as this post spends a lot of time on discussion of methodological rigor with an implication that it is an improvement on existing results, but then misses the true positive for o3 and required following up with the relevant ablations and counterfactuals.
Qualitatively, I’d recommend reading The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLM CoTs, (and Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training) I think model’s doing motivated reasoning into something being what the user wanted (even to completely non-sensical extremes) is common and believing that models are confused / think that’s what the user wanted as an explanation for misaligned behavior is and will continue to be a very effective strategy on the part of the models. I think these are largely cases where reading the CoT is actively misleading if you don’t have this in mind, especially if the threshold for “the model could just be confused” is low, which IME is the default at labs.
Makes sense, thx for the references
I do want to emphasize, sandbagging in o3 was not an official model incrimination investigation in this blog post, and we did not attempt to pursue it with rigor. In terms of the R1 investigation, however, I think that it’d be good for us to also try out the experiments from Anti-Scheming, esp. the consequence sensitivity analysis w/o the deployment framing in the prompt
As for o3, all the blog post says is, btw, these two interventions we developed on R1 replicated on o3 (modulo the concluding sentence which was an oversight). In hindsight, I agree with you that they could mislead someone, and we should have explicitly engaged with relevant results in the literature if we wanted to mention the o3 results, or not mention it at all. Anyhow, appreciate the accountability