That would make the domain of checkable tasks rather small.
That said, it may not matter depending on the capability you want to measure.
If you want to make the AI hack a computer to turn the entire screen green and it skips a pixel so as to avoid completing the task, well it would have still demonstrated that it possesses the dangerous capability, so it has no reason to sandbag.
On the other hand, if you are trying to see if it has a capability that you wish it use, it can still sandbag.
If the AI can create a fake solution that feels more real than the actual solution, I think the task isn’t checkable by Joe’s definition.
That would make the domain of checkable tasks rather small.
That said, it may not matter depending on the capability you want to measure.
If you want to make the AI hack a computer to turn the entire screen green and it skips a pixel so as to avoid completing the task, well it would have still demonstrated that it possesses the dangerous capability, so it has no reason to sandbag.
On the other hand, if you are trying to see if it has a capability that you wish it use, it can still sandbag.