Satron comments on AI Control: Improving Safety Despite Intentional Subversion

Satron 3 Jan 2025 20:07 UTC
4 points
1
I was thinking of ways to improve upon the results of your paper to achieve even better control. You could try having the untrusted model and the trusted model both evaluate the code and then look for differences in how the code is evaluated.

For starters, any code below a −2 monitoring score by the trusted model is automatically run. The rest is passed to a round where the trusted model judges whether the evaluation of the untrusted model is suspiciously more naive than the evaluation of the trusted model (though you might have to ask each model to motivate the score they give the code). The rating the trusted model gave it is used unless the trusted model thinks the evaluation by the untrusted model is not suspicious at all. In cases where it is suspicious, a higher suspicion rating of the code and evaluation is used (or possibly even the sum).

After this, it should be safe to nudge the “safety” barrier a little upwards from −2. There should be less overlap between the suspicion ratings of innocent vs. backdoored code afterward, making it more likely that the 2% of code that can be manually reviewed will have a backdoor in it.

On top of this, you can also keep submitting faulty code and arguments into this process to see if the untrusted model will lie about it, which should improve the outcome further. The best part is that you can iterate this process.

You can have the trusted model repeatedly judge whether the untrusted model’s reasoning is suspiciously naive. You’d basically have the trusted and untrusted model constantly debate the merits of code that the trusted model isn’t absolutely sure is unsuspicious, and if the trusted model ever thinks the untrusted model isn’t making sense, it further ups the total suspicion, whereas if it thinks there is nothing suspicious about the untrusted model’s reasoning at all, it’ll move the total suspicion towards what the untrusted model is arguing for (unless the untrusted model is outright saying the trusted model is correct in a way the trusted model doesn’t think is suspicious, in which case that rating will be used).

In the end, this should lead to more of the most safe code being successfully run, more of the most suspicious code being manually audited, and less of the questionable but not manually reviewable code needing to be written by the trusted model.