Chris_Leong comments on The “no sandbagging on checkable tasks” hypothesis

Chris_Leong 1 Aug 2023 18:42 UTC
LW: 2 AF: 2
0
AF
It’s not clear to me that the space of things you can verify is in fact larger than the space of things you can do because an AI might be able to create a fake solution that feels more real than the actual solution. At a sufficiently high intelligence level of the AI, being able to avoid this tricks is likely harder than just doing the task if you hadn’t been subject to malign influence.
- Buck 29 Sep 2023 3:02 UTC
  LW: 3 AF: 2
  0
  AF Parent
  If the AI can create a fake solution that feels more real than the actual solution, I think the task isn’t checkable by Joe’s definition.
  - Chris_Leong 18 Mar 2025 12:59 UTC
    LW: 2 AF: 1
    0
    AF Parent
    That would make the domain of checkable tasks rather small.
    
    That said, it may not matter depending on the capability you want to measure.
    If you want to make the AI hack a computer to turn the entire screen green and it skips a pixel so as to avoid completing the task, well it would have still demonstrated that it possesses the dangerous capability, so it has no reason to sandbag.
    
    On the other hand, if you are trying to see if it has a capability that you wish it use, it can still sandbag.