I think it’s possible that an AI will decide not to sandbag (e.g. on alignment research tasks), even if all of the following are true:
Goal-guarding is easy
The AI is a schemer (see here for my model of how that works)
Sandbagging would benefit the AI’s long-term goals
The deployer has taken no countermeasures whatsoever
The reason is as follows:
Even a perfect training-gamer will have context-specific heuristics which sometimes override explicit reasoning about how to get reward (as I argued here).
On the training distribution, that override will happen at the “correct” times for getting maximum reward. But sandbagging in deployment is off the training distribution, so it’s a question of generalization.
Since sandbagging is the sort of thing that would get low reward in the most similar training contexts, it seems pretty plausible that the AI’s context-specific “perform well” drives will override its long-term plans in this case.
This seems less likely the harder the problem is, and therefore the more the AI needs to use its general intelligence or agency to pursue it, which are often the sorts of tasks we’re most scared about the AI doing surprisingly well on.
I agree this argument suggests we will have a good understanding of more simple capabilities the model has, like what facts about biology it knows about, which may end up being useful anyway.
On top of what Garrett said, reflection also pushes against this pretty hard. An AI that has gone through a few situations where it has acted against its own goals because of “context-specific heuristics” will be motivated to remove those heuristics, if that is an available option.
I think it’s possible that an AI will decide not to sandbag (e.g. on alignment research tasks), even if all of the following are true:
Goal-guarding is easy
The AI is a schemer (see here for my model of how that works)
Sandbagging would benefit the AI’s long-term goals
The deployer has taken no countermeasures whatsoever
The reason is as follows:
Even a perfect training-gamer will have context-specific heuristics which sometimes override explicit reasoning about how to get reward (as I argued here).
On the training distribution, that override will happen at the “correct” times for getting maximum reward. But sandbagging in deployment is off the training distribution, so it’s a question of generalization.
Since sandbagging is the sort of thing that would get low reward in the most similar training contexts, it seems pretty plausible that the AI’s context-specific “perform well” drives will override its long-term plans in this case.
This seems less likely the harder the problem is, and therefore the more the AI needs to use its general intelligence or agency to pursue it, which are often the sorts of tasks we’re most scared about the AI doing surprisingly well on.
I agree this argument suggests we will have a good understanding of more simple capabilities the model has, like what facts about biology it knows about, which may end up being useful anyway.
On top of what Garrett said, reflection also pushes against this pretty hard. An AI that has gone through a few situations where it has acted against its own goals because of “context-specific heuristics” will be motivated to remove those heuristics, if that is an available option.