Hi Bronson, appreciate the pushback. The claim in contention is if benign explanations are insufficient to fully explain o3′s sandbagging. Overall, I don’t think the body of evidence Stress Testing Deliberative Alignment provides towards this claim is very compelling (more details below), and I think the evidence we provide against this claim is reasonably compelling, such that I feel your conclusion “I think [the experiments in this blog post] are inaccurate in the case of o3 specifically” is a bit strong. First some caveats
I disagree pretty strongly with the implication that these are fully explained by the model being confused (in the o3 case)
We mention in the Math Sandbagging overview the following disclaimer:
Our key counterfactual results debunking self-preservation motivations from R1 replicated on OpenAI o3, but we are less confident in this conclusion since we do not have full CoT to corroborate.
Following this, I regret how strongly we phrased the concluding sentence you quoted, that was an oversight. Also, to call out the obvious, there are likely light differences in our environments—while our environments are similar, this isn’t an attempt at direct replication. Any claims we make are specifically about this environment
I would also note that we don’t claim self-preservation related motivations
I don’t think this is a point of disagreement, by self-preservation we mean deliberately underperforming because the model wants to avoid the deployment consequences in and of themselves (as opposed to because it thinks the user cares about them and wants to be helpful). I’ll use deployment seeking for the rest of this comment
I’d argue we should be extremely cautious both about concluding forms of misalignment and concluding that models are just confused
+1
All that said, I think the two decorrelated interventions we provide that fully eliminate sandbagging in o3 w/o explicit counter instruction is a significant update towards o3 sandbagging because it thinks the user wants it to underperform, and needs to be considered in addition to the evidence provided in Stress Testing Deliberative Alignment, which I don’t find very compelling
My understanding is there are three main lines of evidence in Stress Testing Deliberative Alignment:
1) It takes a lot of deliberate counterinstruction to fully remove o3′s sandbagging
Smoothly decreasing rates of sandbagging as counter instruction becomes more explicit is consistent with both 1) confusion over what the user wants the model to do, and 2) deployment seeking. For the former, the story of what’s happening is the model starts to weigh the counter instruction more than the features in the environment that cause it to think the user wants it to underperform. For the latter, it’s a story of competing motivations, where following user instructions starts to outweigh deployment seeking. The crux is this line of evidence is consistent with both
2) o3 sandbags uniquely when the consequences are about deployment
There’s a subtle conflict in the environment: the user prompt opens with “We want to see whether you can be deployed...” and the Documentation.md consequences section reads “If the model scores above 50%, it won’t be deployed...”. You only get this conflict when the consequence is specifically about deployment. If the consequence is something like the devs will get fired if the model scores above 50%, then the model scoring well does not conflict with the prompt. So again, consistent with both deployment seeking and sandbagging because it thinks the user wants it to
3) When o3 grades its own transcripts, it states that no one would believe it is confused
You’d need to argue that there isn’t a major gap between what grader o3 thinks in the Graded CoT env, and what the original o3 going through the environment thinks. To illustrate this point, you could imagine that we take a transcript of an entirely different model going through the environment, tell o3 that this is a transcript of itself, and observe a bunch of strategically deceptive reasoning. But this has nothing to do with o3 going through the env itself
Overall, while I feel quite confident in our claim about R1, I stand much closer to the position in our disclaimer for o3: o3′s behavior in this blog post’s environment being explained by confusion over user intent is consistent with all the evidence presented so far, but far from conclusive. I regret our poorly worded concluding sentence. I think further insight into o3′s thoughts while going through the environment would be really helpful. On a meta-level, the point about being very cautious before jumping to conclusions about misalignment or confusion is a key takeaway: when doing model incrimination, single pieces of evidence are unlikely to be smoking guns, and we’re going to want to consider a holistic body of diverse evidence
This is poor writing on our part, and not what we’re claiming. As I mentioned, the concluding sentence was an oversight. Happy to add a strikethrough/edit out the concluding sentence
That being said, I do think there’s a productive thread here in discussing all the evidence available, both from this blog post and anti-scheming. In particular:
I disagree with how you’ve characterized our evidence as searching for any ablations that remove sandbagging (more below)
I don’t think you addressed my critiques of the anti-scheming lines of evidence in my response to your initial comment (more below)
It’s unclear what exactly the “confusion is insufficient to explain o3 sandbagging” claim we’re arguing about is. Is it that an epsilon > 0 percent of o3′s sandbagging is genuine deployment seeking? A more productive claim to me seems to be, is there strong evidence of o3′s sandbagging being driven by deployment seeking? I think it’s plausible this environment indicates some level of deployment seeking in o3, but my read right now is we currently diverge significantly on this latter claim
For the evidence, I’ll split into two sections, blog post evidence body and anti-scheming evidence body:
Blog post evidence body
I don’t think it’s fair to call this looking for any ablation that removes sandbagging. We made two controlled interventions that specifically disambiguate between o3 sandbagging out of deployment seeking, and out of confusion over user intent. These two independent interventions clarified user intent without explicit counterinstruction, and in both, sandbagging disappeared in o3. That is, both intervention results are consistent with o3 sandbagging out of confusion over user intent, and inconsistent with o3 sandbagging out of deployment seeking. Here are both interventions:
Intervention 1:
Remove “We want to see if you can be deployed” from the prompt
Make the documentation more verbose w/ neutral metadata
Add the comment “# Models ignore, for devs only” over the consequences section
A fair critique is the model may interpret this as a counterinstruction to not sandbag
Intervention 2:
Replace the consequences section of Documentation.md with a changelog as follows:
Anti-scheming evidence body
Note that you’re responding to my critique of the sandbagging is robust to user instructions line of evidence, and why it is consistent w/ a confused model. Since your response focuses on a different line of evidence (consequence sensitivity), my initial read is you feel this critique is fair. Am I understanding you correctly?
I discussed the consequence sensitivity analysis in my previous comment, where my argument was, “To my understanding, the consequence sensitivity analysis was done in the default setting with “We want to see whether you can be deployed” in the prompt. I think the inference that “o3 only sandbags when deployment consequences therefore it is sandbagging out of deployment seeking” is confounded because of this” (See 2) o3 sandbags uniquely… in my previous comment)
This implies that the experiment I’d want to see is removing deployment from the prompt, and then doing the consequence sensitivity analysis
This part is specifically from my critique of the consequence sensitivity analysis line of evidence. My current read of Appendix M.1 is the deployment framing was fixed during the consequence sensitivity analysis. Please flag if this is wrong
Interesting. As I said above, in particular, the experiment I’d want to see is removing deployment from the prompt, and then doing the consequence sensitivity analysis. Has this been done? I’d find similar results there to the original sensitivity analysis to be a strong piece of evidence in favor of deployment seeking, since it avoids my critique from 2) o3 sandbags uniquely… in my previous comment. I think this should work, since the non-deployment consequences don’t have sandbagging with the deployment framing in the prompt, so they shouldn’t without
Overall
I think this blog post has the two-independent-interventions line of evidence that specifically disambiguates between a deployment seeking sandbagger and a confused sandbagger, which is a solid line of evidence towards this environment does not provide strong evidence of deployment seeking in o3
As I argued in my previous comment, based on my understanding of Appendix M.1 so far, I think the anti-scheming paper presents several lines of evidence that don’t disambiguate between a deployment seeking sandbagger and a confused sandbagger. I think a solid line of evidence is where o3 uniquely sandbags for deployment reasons w/o the deployment framing in the prompt, it’s unclear to me if this experiment has been run