Responding to your point about low-stakes control:
I weakly predict that sandbagging at ~human level (and even a bit after this) will be hard for AI models. Here are some reasons why I’m optimistic about this:
Sandbagging to worse than human level seems really hard:
There are a bunch debate-flavored techniques that could let us do self-supervision.
You might be worried about the AI’s colluding if we do self-supervision, but training can incentivize the AI’s to defect against each other, and collusion might be hard especially if we have some information shielding.
Weak to strong generalization should mean that the actual performance of the AI is better than our reward signals, because there should be some errors in our reward signal that are hard for the AI to model.
One reason that exploration hacking might be hard is that many tasks exist on a continuum (h/t Vivek). Thus, the AI will probably have some task-difficulty level where the AI is unsure whether or not it should exploration hack. If it sandbags too aggressively, training will remove the sandbagging. If it sandbags not aggressively enough then that’s what we wanted in the first place. But if there’s some region in task-difficulty space where the AI is not consistently sandbagging, then RL should just work on this regime, and this should push upwards the frontier of problem difficulty level where the AI is unsure whether it should sandbag. Then you could iterate. This could solve exploration hacking (but not the fact that our reward signals are bad).
Off-policy data could possibly be mitigated in several ways:
We could do behavioral cloning (e.g., Dagger).
You could use a reference solution to guide or grade an AI’s solution.
We could generate a bunch of human training data on doing alignment—for instance, we could ask Anthropic alignment science researchers to record themselves doing alignment research.
Plausibly the fact that sandbagging at human level is hard is sufficient—Hopefully getting to handoff doesn’t require paradigm shifting research. If it does, we might be in more trouble.
Responding to your point about low-stakes control:
I weakly predict that sandbagging at ~human level (and even a bit after this) will be hard for AI models. Here are some reasons why I’m optimistic about this:
Sandbagging to worse than human level seems really hard:
There are a bunch debate-flavored techniques that could let us do self-supervision.
You might be worried about the AI’s colluding if we do self-supervision, but training can incentivize the AI’s to defect against each other, and collusion might be hard especially if we have some information shielding.
Weak to strong generalization should mean that the actual performance of the AI is better than our reward signals, because there should be some errors in our reward signal that are hard for the AI to model.
One reason that exploration hacking might be hard is that many tasks exist on a continuum (h/t Vivek). Thus, the AI will probably have some task-difficulty level where the AI is unsure whether or not it should exploration hack. If it sandbags too aggressively, training will remove the sandbagging. If it sandbags not aggressively enough then that’s what we wanted in the first place. But if there’s some region in task-difficulty space where the AI is not consistently sandbagging, then RL should just work on this regime, and this should push upwards the frontier of problem difficulty level where the AI is unsure whether it should sandbag. Then you could iterate. This could solve exploration hacking (but not the fact that our reward signals are bad).
Off-policy data could possibly be mitigated in several ways:
We could do behavioral cloning (e.g., Dagger).
You could use a reference solution to guide or grade an AI’s solution.
We could generate a bunch of human training data on doing alignment—for instance, we could ask Anthropic alignment science researchers to record themselves doing alignment research.
Plausibly the fact that sandbagging at human level is hard is sufficient—Hopefully getting to handoff doesn’t require paradigm shifting research. If it does, we might be in more trouble.