Hi Cleo, nice post!
I agree with you that mixed deployments are a good tool to have in our toolbox, but I disagree somewhat with your model of why mixed deployments are good.
My thoughts on whether danger is convex/concave in the fraction of your deployment which is scheming:
Suppose that we have the following two options:
50% chance of a pure aligned deployment, 50% chance of a pure misaligned deployment.
A mixed deployment, where a random 50% of actions are taken by an aligned AI, and the other 50% of actions are taken by a misaligned AI.
Which should we prefer? Let’s consider the threat model of weight exfiltration. Suppose that in the pure deployment world the misaligned AI’s chance of success is 30% if we do vanilla untrusted monitoring. Then, the overall risk in the first situation is 15%, because there was a 50% chance that the AI was aligned. So, we’d prefer the mixed deployment described in situation 2 if and only if the mixed deployment has risk below 15%. There are two reasons why the mixed deployment should have lower risk than a pure misaligned deployment:
In the mixed deployment, the misaligned AI has ½ as many chances to take bad actions compared to a pure misaligned deployment.
Using a powerful aligned model to monitor the misaligned model might be more effective than having a copy of the misaligned model monitor the misaligned model.
I suspect that (1) is a very small effect (I talked to Adam and Cody (some authors on Ctrl-Z), and they thought that doubling deployment length while keeping auditing budget a constant fraction of the deployment doesn’t really change the safety and usefulness much). I think that (2) is a larger effect, but am pretty unsure if it’s enough to halve the risk compared to a pure misaligned deployment (I think we should investigate this empirically).
Why I’m excited about mixed deployments despite uncertainty about the above question:
The situation in practice looks a lot better than a forced choice between a pure deployment that might be aligned / misaligned vs a mixed aligned / misaligned deployment. Here are some reasons why the actual situation is better:
As you mention, we could use one AI (e.g., the least suspicious one) for taking actions, and other AIs for monitoring this AI.
If the AI that is taking actions is aligned, then we are in great shape, even if the monitor AI is misaligned: the monitor seems like it’s probably not in a good position to cause lots of harm here.
If the AI that is taking actions is misaligned, and the monitor is aligned, then it still seems like we’re probably in a better spot than vanilla untrusted monitoring, because the monitor will be trying its best to block or catch bad actions.
If the AI taking actions is misaligned, and the monitor is also misaligned, but they are misaligned in different ways and are less capable of / included to collusion than if we had done vanilla untrusted monitoring, then we’re still in a better spot than we would be counterfactually. For this reason, I think that it might be a good idea to train clearly misaligned AI models (e.g., reward seekers) and use these as monitors on some occasions.
We can adjust the parameters of our mixed deployment in response to evidence. If we collect evidence that makes us think that one AI is more likely to be misaligned, we can stop deploying that AI, or only use that AI as a monitor.
Implications for what research should be done
It might be valuable for someone to do a research project investigating how good mixed deployments are, compared to pure aligned and pure misaligned deployments. Specifically, here’s what I’m imagining:
You have a control setting, e.g., “main task: solve APPs coding questions, side task: insert a backdoor”.
The blue team is given two powerful models, each of which is either aligned or misaligned.
The blue team uses some combination of these models to try to solve the coding questions as well as possible, while avoiding backdoors via monitoring. The blue team also wins if they incriminate an AI.
It’s totally fair game for the blue team to decide “this model has done some suspicious things, so we’ll use the other model instead”.
We measure how much safety the blue team can achieve given a certain usefulness requirement.
I think the blue team might be able to do extremely well with mixed deployments, e.g., much better than halving risk compared to a pure misaligned deployment.
Hi Cleo, nice post! I agree with you that mixed deployments are a good tool to have in our toolbox, but I disagree somewhat with your model of why mixed deployments are good.
My thoughts on whether danger is convex/concave in the fraction of your deployment which is scheming:
Suppose that we have the following two options:
50% chance of a pure aligned deployment, 50% chance of a pure misaligned deployment.
A mixed deployment, where a random 50% of actions are taken by an aligned AI, and the other 50% of actions are taken by a misaligned AI.
Which should we prefer? Let’s consider the threat model of weight exfiltration. Suppose that in the pure deployment world the misaligned AI’s chance of success is 30% if we do vanilla untrusted monitoring. Then, the overall risk in the first situation is 15%, because there was a 50% chance that the AI was aligned. So, we’d prefer the mixed deployment described in situation 2 if and only if the mixed deployment has risk below 15%. There are two reasons why the mixed deployment should have lower risk than a pure misaligned deployment:
In the mixed deployment, the misaligned AI has ½ as many chances to take bad actions compared to a pure misaligned deployment.
Using a powerful aligned model to monitor the misaligned model might be more effective than having a copy of the misaligned model monitor the misaligned model.
I suspect that (1) is a very small effect (I talked to Adam and Cody (some authors on Ctrl-Z), and they thought that doubling deployment length while keeping auditing budget a constant fraction of the deployment doesn’t really change the safety and usefulness much). I think that (2) is a larger effect, but am pretty unsure if it’s enough to halve the risk compared to a pure misaligned deployment (I think we should investigate this empirically).
Why I’m excited about mixed deployments despite uncertainty about the above question:
The situation in practice looks a lot better than a forced choice between a pure deployment that might be aligned / misaligned vs a mixed aligned / misaligned deployment. Here are some reasons why the actual situation is better:
As you mention, we could use one AI (e.g., the least suspicious one) for taking actions, and other AIs for monitoring this AI.
If the AI that is taking actions is aligned, then we are in great shape, even if the monitor AI is misaligned: the monitor seems like it’s probably not in a good position to cause lots of harm here.
If the AI that is taking actions is misaligned, and the monitor is aligned, then it still seems like we’re probably in a better spot than vanilla untrusted monitoring, because the monitor will be trying its best to block or catch bad actions.
If the AI taking actions is misaligned, and the monitor is also misaligned, but they are misaligned in different ways and are less capable of / included to collusion than if we had done vanilla untrusted monitoring, then we’re still in a better spot than we would be counterfactually. For this reason, I think that it might be a good idea to train clearly misaligned AI models (e.g., reward seekers) and use these as monitors on some occasions.
We can adjust the parameters of our mixed deployment in response to evidence. If we collect evidence that makes us think that one AI is more likely to be misaligned, we can stop deploying that AI, or only use that AI as a monitor.
Implications for what research should be done
It might be valuable for someone to do a research project investigating how good mixed deployments are, compared to pure aligned and pure misaligned deployments. Specifically, here’s what I’m imagining:
You have a control setting, e.g., “main task: solve APPs coding questions, side task: insert a backdoor”.
The blue team is given two powerful models, each of which is either aligned or misaligned.
The blue team uses some combination of these models to try to solve the coding questions as well as possible, while avoiding backdoors via monitoring. The blue team also wins if they incriminate an AI.
It’s totally fair game for the blue team to decide “this model has done some suspicious things, so we’ll use the other model instead”.
We measure how much safety the blue team can achieve given a certain usefulness requirement.
I think the blue team might be able to do extremely well with mixed deployments, e.g., much better than halving risk compared to a pure misaligned deployment.