I used to think that AI models weren’t smart enough to sandbag. But less intelligent animals can sandbag—eg an animal who apparently can’t do something but is able to when it lets them escape, or access treats, or otherwise get outsized rewards. Presumably this occurs without an inner monologue or a strategic decision to sandbag. If so, AI models are already plausibly smart enough to sandbag in general, without it being detectable in chain-of-thought, and then perform better in high-value opportunities.
anecdotes: certain dogs learn to be uncomprehending, as it gives them license to stubbornly pursue goals they have been told off of. you can catch them in it, but you have to be wise.
I used to think that AI models weren’t smart enough to sandbag. But less intelligent animals can sandbag—eg an animal who apparently can’t do something but is able to when it lets them escape, or access treats, or otherwise get outsized rewards. Presumably this occurs without an inner monologue or a strategic decision to sandbag. If so, AI models are already plausibly smart enough to sandbag in general, without it being detectable in chain-of-thought, and then perform better in high-value opportunities.
anecdotes: certain dogs learn to be uncomprehending, as it gives them license to stubbornly pursue goals they have been told off of. you can catch them in it, but you have to be wise.