For this reason, regulatory thresholds should ideally use benchmarks that cover a wide range of difficulties, including extremely difficult questions.
Yes—and a further point on the importance of a range of difficulties: For a model to sandbag succesfully across a carefully designed eval suite, it must simulate a plausible capability profile, and coordinate this simulation across independent instances. Very difficult or dangerous evals are easy to coordinate on (the model should sandbag) but when presented with moderate difficulty evals across various capabilities which may only be concerning in aggregate—it is less obvious which of these to coordinate on and sandbag. Note the importance of a model’s self-awareness of its capability to simulate a plausible capability profile to its decision to even attempt to sandbag!
Agreed that successful sandbagging will likely require Schelling coordination, and my guess is that this will be extremely difficult for models to pull off! Great to see that you’re investigating this topic.
Indeed, current models are terrible at this! Still, worth keeping an eye on it, as it would complicate dangerous capability evals quite a bit should it emerge.
Yes—and a further point on the importance of a range of difficulties: For a model to sandbag succesfully across a carefully designed eval suite, it must simulate a plausible capability profile, and coordinate this simulation across independent instances. Very difficult or dangerous evals are easy to coordinate on (the model should sandbag) but when presented with moderate difficulty evals across various capabilities which may only be concerning in aggregate—it is less obvious which of these to coordinate on and sandbag. Note the importance of a model’s self-awareness of its capability to simulate a plausible capability profile to its decision to even attempt to sandbag!
Agreed that successful sandbagging will likely require Schelling coordination, and my guess is that this will be extremely difficult for models to pull off! Great to see that you’re investigating this topic.
Indeed, current models are terrible at this! Still, worth keeping an eye on it, as it would complicate dangerous capability evals quite a bit should it emerge.