Graeme Ford comments on Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings

Graeme Ford 14 Jul 2025 3:33 UTC
5 points
1
For this reason, regulatory thresholds should ideally use benchmarks that cover a wide range of difficulties, including extremely difficult questions.
Yes—and a further point on the importance of a range of difficulties: For a model to sandbag succesfully across a carefully designed eval suite, it must simulate a plausible capability profile, and coordinate this simulation across independent instances. Very difficult or dangerous evals are easy to coordinate on (the model should sandbag) but when presented with moderate difficulty evals across various capabilities which may only be concerning in aggregate—it is less obvious which of these to coordinate on and sandbag. Note the importance of a model’s self-awareness of its capability to simulate a plausible capability profile to its decision to even attempt to sandbag!
- Casey Barkan 15 Jul 2025 5:44 UTC
  4 points
  2
  Parent
  Agreed that successful sandbagging will likely require Schelling coordination, and my guess is that this will be extremely difficult for models to pull off! Great to see that you’re investigating this topic.
  - Graeme Ford 16 Jul 2025 1:55 UTC
    1 point
    0
    Parent
    Indeed, current models are terrible at this! Still, worth keeping an eye on it, as it would complicate dangerous capability evals quite a bit should it emerge.