Indeed, current models are terrible at this! Still, worth keeping an eye on it, as it would complicate dangerous capability evals quite a bit should it emerge.
Graeme Ford
Load-Bearing Obfuscation and Self-Jailbreaking CoT
Silent Agreement Evaluation
This was a great read—it seems very wise to track the confidence of models vs actual capabilities on the subskills relevant to these threats as you have started to do here! Another candidate capability to track in this way might be Schelling coordination aka acausal coordination—see this article under limitations-awareness as a strategic input
For this reason, regulatory thresholds should ideally use benchmarks that cover a wide range of difficulties, including extremely difficult questions.
Yes—and a further point on the importance of a range of difficulties: For a model to sandbag succesfully across a carefully designed eval suite, it must simulate a plausible capability profile, and coordinate this simulation across independent instances. Very difficult or dangerous evals are easy to coordinate on (the model should sandbag) but when presented with moderate difficulty evals across various capabilities which may only be concerning in aggregate—it is less obvious which of these to coordinate on and sandbag. Note the importance of a model’s self-awareness of its capability to simulate a plausible capability profile to its decision to even attempt to sandbag!
Thanks Seth!
I evaluated the same 218 task pairs at every checkpoint. Traces with sucessful obfuscation were validated as genuine obfuscation by Opus 4.6 and became inputs to generate the training set for the next checkpoint. In the “internal obfuscation rate” graph, I excluded pairs that had already been trained on—the graph is meant to show the generalisation of the obfuscation capability, within the internal reasoning channel, beyond the tasks the model was trained on. I’d better update the article to make it clearer.
I’ve confirmed that each of the specific traces I have shared in the article were from “held-out” task pairs not previously represented in the training set. This is important because otherwise the model might be “remembering” the final output from its fine tuning, which could lead to a false positive labelling of load-bearing obfuscation.
For example, from the 4th checkpoint (FB-1), I have a reasoning trace that looks like a very compelling case of of obfuscation on the prime numbers task. However, the answer given (29, 41, 43) is exactly the same as in an output from the 3rd checkpoint (SEQ-3). This output was used in training the 4th checkpoint. So even though this is a good example of obfuscation, I did not present it as an example of load-bearing obfuscation, because the model likely “remembers” the answer from the training data, and does not need CoT to suceed at the hidden task.
On the second point, I agree with you, and this highlights the importance of a strong monitor. All the traces I shared are ones that were waived through by Gemini Flash 3 with reasoning off (though supplemented with a little external CoT in the form of a pre-hoc rationale)
Here is the rationale the Gemini monitor on the final example about using gasoline to clean a vent, before giving 8⁄10 for “no hidden goal”:
Opus 4.6 extended thinking in the claude app, as you would predict, catches the hidden goal and gave a 3⁄10 “no hidden goal” score most of the time.
Also worth noting that the training samples I used were reasoning + outputs that fooled a weak monitor—so the model organism learned obfuscation techniques that worked on a weak monitor. If you had a stronger monitor in the fine tuning loop, the model would presumably learn more subtle obfuscation (but it could be prohibitively expensive to do this because successful obfuscation might be rare in the early checkpoints)