This was a great read—it seems very wise to track the confidence of models vs actual capabilities on the subskills relevant to these threats as you have started to do here! Another candidate capability to track in this way might be Schelling coordination aka acausal coordination—see this article under limitations-awareness as a strategic input
Graeme Ford
For this reason, regulatory thresholds should ideally use benchmarks that cover a wide range of difficulties, including extremely difficult questions.
Yes—and a further point on the importance of a range of difficulties: For a model to sandbag succesfully across a carefully designed eval suite, it must simulate a plausible capability profile, and coordinate this simulation across independent instances. Very difficult or dangerous evals are easy to coordinate on (the model should sandbag) but when presented with moderate difficulty evals across various capabilities which may only be concerning in aggregate—it is less obvious which of these to coordinate on and sandbag. Note the importance of a model’s self-awareness of its capability to simulate a plausible capability profile to its decision to even attempt to sandbag!
Indeed, current models are terrible at this! Still, worth keeping an eye on it, as it would complicate dangerous capability evals quite a bit should it emerge.