Prover-verifier games as an alternative to AI control.
AI control has been suggested as a way of safely deploying highly capable models without the need for rigorous proof of alignment. This line of work is likely quite important in worlds where we do not expect to be able to fully align frontier AI systems.
The formulation depends on having access to a weaker, untrusted model. Recent work proposes and evaluates several specific protocols involving AI control; ‘resampling’ is found to be particularly effective. (Aside: this is consistent with ‘ensembling’ often yielding performance gains)
Prover-verifier games have a similar setup involving large, capable models (‘provers’) and smaller auxiliary models (‘verifiers’). The verifiers are (adversarially) trained to be effective classifiers of ‘reasoning correctness’. One benefit of this procedure is that the provers that go through this training process are more human-legible, i.e. humans will readily accept correct solutions and reject similar-looking incorrect ones (put another way, the reasoning is transparent enough that humans can identify when subtle errors are introduced).
Notable differences between these two:
PVGs are a training procedure whereas AI control is a test-time procedure. Still, it’s possible that inference-time algorithms (or even test-time training) in the style of PVGs may constitute valid AI control protocols.
Current work on PVGs mostly addresses reasoning correctness, i.e. faithfulness of the CoT. It’s unclear whether we expect the approach to also work for general alignment.
I’d note that we consider AI control to include evaluation time measures, not just test-time measures. (For instance, we consider adversarial evaluation of an untrusted monitor in the original control paper.)
(We also can model training from a black-box control perspective by being conservative about inductive biases. For supervised fine-tuning (with an assumption of no gradient hacking), we can assume that training forces the AI to perform well on average on tasks indistinguishable from the training distribution (and training on new data can also be considered evaluation). For RL the picture is more complex due to exploration hacking.)
I need to get around to writing up the connection between PVGs and AI control. There’s definitely a lot of overlap, but the formalisms are fairly different and the protocols can’t always be directly transferred from one to the other.
There are a few AI control projects in the works that make the connection between AI control and PVGs more explicitly.
EDIT: Oh actually @Ansh Radhakrishnan and I already wrote up some stuff about this, see here.
Prover-verifier games as an alternative to AI control.
AI control has been suggested as a way of safely deploying highly capable models without the need for rigorous proof of alignment. This line of work is likely quite important in worlds where we do not expect to be able to fully align frontier AI systems.
The formulation depends on having access to a weaker, untrusted model. Recent work proposes and evaluates several specific protocols involving AI control; ‘resampling’ is found to be particularly effective. (Aside: this is consistent with ‘ensembling’ often yielding performance gains)
Prover-verifier games have a similar setup involving large, capable models (‘provers’) and smaller auxiliary models (‘verifiers’). The verifiers are (adversarially) trained to be effective classifiers of ‘reasoning correctness’. One benefit of this procedure is that the provers that go through this training process are more human-legible, i.e. humans will readily accept correct solutions and reject similar-looking incorrect ones (put another way, the reasoning is transparent enough that humans can identify when subtle errors are introduced).
Notable differences between these two:
PVGs are a training procedure whereas AI control is a test-time procedure. Still, it’s possible that inference-time algorithms (or even test-time training) in the style of PVGs may constitute valid AI control protocols.
Current work on PVGs mostly addresses reasoning correctness, i.e. faithfulness of the CoT. It’s unclear whether we expect the approach to also work for general alignment.
I’d note that we consider AI control to include evaluation time measures, not just test-time measures. (For instance, we consider adversarial evaluation of an untrusted monitor in the original control paper.)
(We also can model training from a black-box control perspective by being conservative about inductive biases. For supervised fine-tuning (with an assumption of no gradient hacking), we can assume that training forces the AI to perform well on average on tasks indistinguishable from the training distribution (and training on new data can also be considered evaluation). For RL the picture is more complex due to exploration hacking.)
Thanks for bringing this up!
I need to get around to writing up the connection between PVGs and AI control. There’s definitely a lot of overlap, but the formalisms are fairly different and the protocols can’t always be directly transferred from one to the other.
There are a few AI control projects in the works that make the connection between AI control and PVGs more explicitly.
EDIT: Oh actually @Ansh Radhakrishnan and I already wrote up some stuff about this, see here.