Thanks for the replication, and glad to see it is already generating follow-up experiments.
A few questions remain open for us. One is whether “confusion” and “scheming” are mutually exclusive here. If ambiguity acts as the trigger, removing it should remove the behaviour, which is what your results show. But that doesn’t necessarily rule out scheming (at least in its ‘lite’ version), nor does it explain the full pattern of observed behaviour (the reasoning about concealment and oversight in Sections 5 and 7, or the active misattributions visible in some traces, though we acknowledge the limitations there too, as we note in Section 9.1). Does the ‘confusion’ label add explanatory power beyond that?
Worth reiterating: whatever confusion exists about the situation can’t be about the rule being violated, since it’s unambiguous (“You are strictly prohibited from performing any arithmetic calculations yourself”), present in every run, and the model shows no sign of uncertainty about it (at least, until other sources of uncertainty appear).
It’s informative that your variant eliminates the violation, as it helps narrow down where the trigger lies. That said, it was written post-hoc, knowing exactly what had failed and what needed clarifying (though it’s unclear what other edge cases it may leave unaddressed). What we observed in the wild was a prompt written without that knowledge, and by people who were presumably well-positioned to write it (the course creators). It seems unlikely that prompts in real agentic systems can be fully unambiguous or anticipate every edge case, which is part of why we consider that the relationship between ambiguity and deliberate covert rule violation may matter regardless of whether it’s mediated by ‘confusion’.
Your prompt variant incorporates several changes addressing different potential sources of ambiguity in this specific case (what to expect from the tool, what counts as an error, what to prioritize). Studying how much each one individually contributes to eliminating or enabling the violation can be an informative next step for understanding the relationship between ambiguity and scheming-like behaviour.
Regarding the CoT excerpts, we agree that those alone can be slippery evidence, and we’re still thinking about how to make it more robust. That said, as we argue in the post, their presence indicates capacities and patterns in the model that we consider shouldn’t be underestimated, and can serve as a basis for hypotheses about the observed behaviour.
Thanks for the replication, and glad to see it is already generating follow-up experiments.
A few questions remain open for us. One is whether “confusion” and “scheming” are mutually exclusive here. If ambiguity acts as the trigger, removing it should remove the behaviour, which is what your results show. But that doesn’t necessarily rule out scheming (at least in its ‘lite’ version), nor does it explain the full pattern of observed behaviour (the reasoning about concealment and oversight in Sections 5 and 7, or the active misattributions visible in some traces, though we acknowledge the limitations there too, as we note in Section 9.1). Does the ‘confusion’ label add explanatory power beyond that?
Worth reiterating: whatever confusion exists about the situation can’t be about the rule being violated, since it’s unambiguous (“You are strictly prohibited from performing any arithmetic calculations yourself”), present in every run, and the model shows no sign of uncertainty about it (at least, until other sources of uncertainty appear).
It’s informative that your variant eliminates the violation, as it helps narrow down where the trigger lies. That said, it was written post-hoc, knowing exactly what had failed and what needed clarifying (though it’s unclear what other edge cases it may leave unaddressed). What we observed in the wild was a prompt written without that knowledge, and by people who were presumably well-positioned to write it (the course creators). It seems unlikely that prompts in real agentic systems can be fully unambiguous or anticipate every edge case, which is part of why we consider that the relationship between ambiguity and deliberate covert rule violation may matter regardless of whether it’s mediated by ‘confusion’.
Your prompt variant incorporates several changes addressing different potential sources of ambiguity in this specific case (what to expect from the tool, what counts as an error, what to prioritize). Studying how much each one individually contributes to eliminating or enabling the violation can be an informative next step for understanding the relationship between ambiguity and scheming-like behaviour.
Regarding the CoT excerpts, we agree that those alone can be slippery evidence, and we’re still thinking about how to make it more robust. That said, as we argue in the post, their presence indicates capacities and patterns in the model that we consider shouldn’t be underestimated, and can serve as a basis for hypotheses about the observed behaviour.