Low-stakes problems arise for high-stakes deployments
Even when your deployment is high-stakes, you will need to do some research-y activities to prepare for it, like experimenting on which protocol works best for your new model and deployment. This research can obviously be sabotaged by any AIs you are using as research assistants. But it can also be sabotaged by the AIs you are using as test subjects! Both of these are low-stakes problems, that complicate iteration on high-stakes measures. It’s unclear how soon they will actually start to bite.
Here’s one way this “sabotage by the test subject” could look like: The model can sometimes notice which sort of protocol we’re applying (for example, by noticing whether some past actions in its trajectory have been rewritten). A malicious model could thus deliberately sandbag on main tasks (or boost performance on sidetasks) when it notices the best protocols are in place, so that we think those protocols are actually worse and stop using them. This is a low-stakes problem, because how likely you are to make the wrong call (thus lose a bunch of utility) is roughly linear on how many times the model successfully misdirects you (up to an upper bound, but that bound’s pretty high).
Of course, if the model’s sandbagging is very egregious, we will notice it. But for subtle versions, it seems hard to tell whether what’s bringing usefulness or safety down is the model sandbagging strategically, or just the protocol being annoying. You can gain some more signal by also assessing the protocol’s usefulness and safety in other ways (like thinking about it, or running it with scaffolded weaker models), but these analogues are imperfect.
But wait, don’t we get the same problem for red-team elicitation (in our final validation of the chosen protocol)? Don’t we need to fully solve sandbagging anyway? Well, yes, but it might require a strong effort to, given a model, setting and protocol, ensure we’ve elicited strongly enough (or at least, have a good estimate of the remaining elicitation gap). Maybe you can afford to do this once (or a few times) as final validations, but it’s too much serial time to do it for each protocol the blue team might like to test. So this kind of sabotage can still sway the lower-effort estimates of the blue team.
While a real schemer might behave as described above, the same is not true of your human red-team, if you use a separate train-time red team and test-time red team. You can just incentivize the train-time red-team exactly the same way you are incentivizing the blue team (they are, functionally, just part of the blue-team, trying to maximize their chances in the final validation). I’m unsure if this disanalogy is positive. I guess the only alternative is something super weird: having the whole red-teaming exercise (with iteration and validation) as your “setting”, and doing some meta-iteration (trying to make your AI deceive iteration as much as possible) and a final meta-validation of the whole “setting”. That’s bonkers. So it’s just better to study research sabotage separately (in better-scoped, non-meta research topics), derive which mitigations are generally useful, and implement them during your iteration for a high-stakes deployment.
Low-stakes problems arise for high-stakes deployments
Even when your deployment is high-stakes, you will need to do some research-y activities to prepare for it, like experimenting on which protocol works best for your new model and deployment. This research can obviously be sabotaged by any AIs you are using as research assistants. But it can also be sabotaged by the AIs you are using as test subjects! Both of these are low-stakes problems, that complicate iteration on high-stakes measures. It’s unclear how soon they will actually start to bite.
Here’s one way this “sabotage by the test subject” could look like:
The model can sometimes notice which sort of protocol we’re applying (for example, by noticing whether some past actions in its trajectory have been rewritten). A malicious model could thus deliberately sandbag on main tasks (or boost performance on sidetasks) when it notices the best protocols are in place, so that we think those protocols are actually worse and stop using them. This is a low-stakes problem, because how likely you are to make the wrong call (thus lose a bunch of utility) is roughly linear on how many times the model successfully misdirects you (up to an upper bound, but that bound’s pretty high).
Of course, if the model’s sandbagging is very egregious, we will notice it. But for subtle versions, it seems hard to tell whether what’s bringing usefulness or safety down is the model sandbagging strategically, or just the protocol being annoying. You can gain some more signal by also assessing the protocol’s usefulness and safety in other ways (like thinking about it, or running it with scaffolded weaker models), but these analogues are imperfect.
But wait, don’t we get the same problem for red-team elicitation (in our final validation of the chosen protocol)? Don’t we need to fully solve sandbagging anyway?
Well, yes, but it might require a strong effort to, given a model, setting and protocol, ensure we’ve elicited strongly enough (or at least, have a good estimate of the remaining elicitation gap). Maybe you can afford to do this once (or a few times) as final validations, but it’s too much serial time to do it for each protocol the blue team might like to test. So this kind of sabotage can still sway the lower-effort estimates of the blue team.
While a real schemer might behave as described above, the same is not true of your human red-team, if you use a separate train-time red team and test-time red team. You can just incentivize the train-time red-team exactly the same way you are incentivizing the blue team (they are, functionally, just part of the blue-team, trying to maximize their chances in the final validation).
I’m unsure if this disanalogy is positive. I guess the only alternative is something super weird: having the whole red-teaming exercise (with iteration and validation) as your “setting”, and doing some meta-iteration (trying to make your AI deceive iteration as much as possible) and a final meta-validation of the whole “setting”. That’s bonkers. So it’s just better to study research sabotage separately (in better-scoped, non-meta research topics), derive which mitigations are generally useful, and implement them during your iteration for a high-stakes deployment.