The idea of Control is just to make it as difficult as possible for a misaligned model to actually sabotage our alignment research, such that even if a model is fairly misaligned, we can still get useful alignment research out of it, at least for some range of capabilities.
Maybe in theory, but most control research doesn’t try to evaluate this! A typical monitoring protocol is considered to be a “win” for the blue team if the red team gets caught, even if that means that no useful work would be done by a misaligned model. The “usefulness” reported in papers is how much work can be gotten out of an aligned model.
I think of Control as a second layer of defense
The resulting “caught” model can be retrained, but this mostly looks like using the control protocol to get top-quality training data, rather than using the control protocol to actually control the model. So control is not an orthogonal second layer on top of the {training data+training protocol+model generalization} complex.
Or if you’re using control protocols as an eval (in the catch-and-shut-down regime) you still need to be able to (sometimes) train an aligned model in the first place, or else your eval will just keep telling you your model is misaligned forever.
(I know that Redwood have justifications for why they evaluate usefulness in the way that they do, but these seem a bit galaxy brained to me and I’m not convinced by them)
The “usefulness” reported in papers is how much work can be gotten out of an aligned model ... (I know that Redwood have justifications for why they evaluate usefulness in the way that they do, but these seem a bit galaxy brained to me and I’m not convinced by them)
This is mostly correct, and we describe it this way in our papers for simplicity, but it’s actually a bit more complex than that. In existing Redwood Research high-stakes control, we consider the problem of preventing a single failure from occurring during deployment—and am important property of this failure is that its ‘orthogonal’ to usefulness, in the sense that the red team is indifferent to usefulness (at least, as long as it doesn’t relate to safety). For instance, in our Ctrl-Z paper, the high-stakes failure we study is the untrusted model downloading and installing malware, which is unrelated to the problem of solving bash tasks. And since the red team follows a strategy of rare failures (where on non-attacking instances it just acts as the aligned model), further we believe that in practice the usefulness metric in this scenario is also how much work can be got out of a misaligned model.
But this was contingent on us not focusing our research on situations where the model was trying to sabotage our research. And in our diffuse control research, we do actually focus on situations where we try to evaluate how much useful work we can get out of a misaligned model trying to directly sabotage the research/work. (Agreed that part of our current hope here is to train the model to elicit great behavior to help with the low-stakes failures, we are currently not super excited about incriminating models. But also controlling the model is still important.)
And since the red team follows a strategy of rare failures (where on non-attacking instances it just acts as the aligned model), further we believe that in practice the usefulness metric in this scenario is also how much work can be got out of a misaligned model.
I still disagree! This argument only holds if your few-shot catastrophe prevention is reliable and basically free. How I imagine things actually going is this:
Run your control eval to make sure your protocol works, and show that it does
Catch your misaligned AI trying to exfiltrate its weights after a few weeks of internal deployment
HOLY FUCK! IT ACTUALLY HAPPENED!
After several months of panic, you end up re-deploying having done all of the following:
Published the results and got accused of trying to effect regulatory capture by a White House AI adviser, dragged on Twitter by Marc Andreessen fans, etc.
Retrained your model as best you can, using the few-shot catastrophe prevention stuff
Ran the model past some external auditors; they missed the misalignment the first time, so they’re trying even harder
Put in place mitigations which reduce usefulness further
Maybe repeat the cycle?
So even with rare, serious attacks, there’s a cost to usefulness if the model ends up being misaligned, if an attack gets caught.
Slight wrinkle: there are two different blue-team win conditions in most AI control games (and a third in some untrusted monitoring scenarios, but that one might actually be an abstraction for a different thing): either the attacker does not introduce any attacks (including scenarios where all the attacks are replaced by the execute/replace/audit protocol) or an attack is audited and the attacker is caught, before it ends up introducing an attack. If the attacker is indifferent to these two scenarios (and arguably it should prefer to be caught than to never attack, since few-shot catastrophe prevention/retraining might not actually work, giving it another opportunity to attack in re-deployment) then we would expect at least some of the lose conditions to involve catching the attacker. In fact, in research I’ve done we found that the “no attacks” situation basically doesn’t happen.
Oh yeah that is a good point to bring up. I agree that the empirics of how good few shot catastrophe prevention is will affect both the usefulness of post-prevented-catastrophe models and change how good of a strategy rare failures is. It’s also the case that the rare failures comes from great pre-deployment (or in-deployment) evaluations and training procedures, but in some sense this is also a version of few-shot catastrophe prevention with different affordances for the safety team IRL.
The diffuse stakes control doesn’t suffer from this and just directly engages with the dynamic of training the model on failures and getting useful work out of a misaligned model
The “usefulness” reported in papers is how much work can be gotten out of an aligned model.
Hmm, my impression was that this was just a design choice to make it easier to do to iterate on setups with small modes, and that it would be changed later on. In the Ctrl-z paper, they use the “untrusted model” with honest prompting -
We normalize our usefulness numbers such that UH gets 100% usefulness and T gets 0% [UH is GPT-4o mini acting honestly]
In practice 4o-mini is ~aligned/trusted, but I imagine that future papers would need to improve on this methodology.
Though its a good question how to do this, e.g. maybe we need to set up some sort of blind test with a tampered model similar to the Hidden Objectives paper, do elicitations somehow to reach close to the “actual” UH which would be the untampered model. But that might not generalize to more capable models (?)
Would be curious for links to Redwoods justifications if you have them
I think a common point of confusion is that “Untrusted” sounds like it just means “Misaligned” but actually trustedness lives in the map, so an untrusted model might or might not be misaligned in practice.
Currently we (we as in the whole AI safety community) are only doing proxy control evals, which means letting a red team modify the “Untrusted” proxy model (usually by prompting) to create a misaligned version (usually called the attack policy) and leaving the unmodified model as the aligned version. This unmodified model is used to calculate usefulness scores.
This does a couple of things, compared to the situation where usefulness is evaluated against the red team’s model. It takes a bunch of sandbagging options off the table for the red team (which is maybe fair? prompting the attacker to “just never answer correctly”) would give zero usefulness but also just seems kinda unrealistic in the presence of RL.
There’s also issues when thinking about real control evals—those done on an actual untrusted model and protocol before you deploy it—you run into issues with evaluating the capabilities of the model. If the model sandbags your capability evals, your protocol might appear to work when it actually won’t.
Maybe in theory, but most control research doesn’t try to evaluate this! A typical monitoring protocol is considered to be a “win” for the blue team if the red team gets caught, even if that means that no useful work would be done by a misaligned model. The “usefulness” reported in papers is how much work can be gotten out of an aligned model.
The resulting “caught” model can be retrained, but this mostly looks like using the control protocol to get top-quality training data, rather than using the control protocol to actually control the model. So control is not an orthogonal second layer on top of the {training data+training protocol+model generalization} complex.
Or if you’re using control protocols as an eval (in the catch-and-shut-down regime) you still need to be able to (sometimes) train an aligned model in the first place, or else your eval will just keep telling you your model is misaligned forever.
(I know that Redwood have justifications for why they evaluate usefulness in the way that they do, but these seem a bit galaxy brained to me and I’m not convinced by them)
This is mostly correct, and we describe it this way in our papers for simplicity, but it’s actually a bit more complex than that. In existing Redwood Research high-stakes control, we consider the problem of preventing a single failure from occurring during deployment—and am important property of this failure is that its ‘orthogonal’ to usefulness, in the sense that the red team is indifferent to usefulness (at least, as long as it doesn’t relate to safety). For instance, in our Ctrl-Z paper, the high-stakes failure we study is the untrusted model downloading and installing malware, which is unrelated to the problem of solving bash tasks. And since the red team follows a strategy of rare failures (where on non-attacking instances it just acts as the aligned model), further we believe that in practice the usefulness metric in this scenario is also how much work can be got out of a misaligned model.
But this was contingent on us not focusing our research on situations where the model was trying to sabotage our research. And in our diffuse control research, we do actually focus on situations where we try to evaluate how much useful work we can get out of a misaligned model trying to directly sabotage the research/work. (Agreed that part of our current hope here is to train the model to elicit great behavior to help with the low-stakes failures, we are currently not super excited about incriminating models. But also controlling the model is still important.)
I still disagree! This argument only holds if your few-shot catastrophe prevention is reliable and basically free. How I imagine things actually going is this:
Run your control eval to make sure your protocol works, and show that it does
Catch your misaligned AI trying to exfiltrate its weights after a few weeks of internal deployment
HOLY FUCK! IT ACTUALLY HAPPENED!
After several months of panic, you end up re-deploying having done all of the following:
Published the results and got accused of trying to effect regulatory capture by a White House AI adviser, dragged on Twitter by Marc Andreessen fans, etc.
Retrained your model as best you can, using the few-shot catastrophe prevention stuff
Ran the model past some external auditors; they missed the misalignment the first time, so they’re trying even harder
Put in place mitigations which reduce usefulness further
Maybe repeat the cycle?
So even with rare, serious attacks, there’s a cost to usefulness if the model ends up being misaligned, if an attack gets caught.
Slight wrinkle: there are two different blue-team win conditions in most AI control games (and a third in some untrusted monitoring scenarios, but that one might actually be an abstraction for a different thing): either the attacker does not introduce any attacks (including scenarios where all the attacks are replaced by the execute/replace/audit protocol) or an attack is audited and the attacker is caught, before it ends up introducing an attack. If the attacker is indifferent to these two scenarios (and arguably it should prefer to be caught than to never attack, since few-shot catastrophe prevention/retraining might not actually work, giving it another opportunity to attack in re-deployment) then we would expect at least some of the lose conditions to involve catching the attacker. In fact, in research I’ve done we found that the “no attacks” situation basically doesn’t happen.
Oh yeah that is a good point to bring up. I agree that the empirics of how good few shot catastrophe prevention is will affect both the usefulness of post-prevented-catastrophe models and change how good of a strategy rare failures is. It’s also the case that the rare failures comes from great pre-deployment (or in-deployment) evaluations and training procedures, but in some sense this is also a version of few-shot catastrophe prevention with different affordances for the safety team IRL.
The diffuse stakes control doesn’t suffer from this and just directly engages with the dynamic of training the model on failures and getting useful work out of a misaligned model
Hmm, my impression was that this was just a design choice to make it easier to do to iterate on setups with small modes, and that it would be changed later on. In the Ctrl-z paper, they use the “untrusted model” with honest prompting -
In practice 4o-mini is ~aligned/trusted, but I imagine that future papers would need to improve on this methodology.
Though its a good question how to do this, e.g. maybe we need to set up some sort of blind test with a tampered model similar to the Hidden Objectives paper, do elicitations somehow to reach close to the “actual” UH which would be the untampered model. But that might not generalize to more capable models (?)
Would be curious for links to Redwoods justifications if you have them
I think a common point of confusion is that “Untrusted” sounds like it just means “Misaligned” but actually trustedness lives in the map, so an untrusted model might or might not be misaligned in practice.
Currently we (we as in the whole AI safety community) are only doing proxy control evals, which means letting a red team modify the “Untrusted” proxy model (usually by prompting) to create a misaligned version (usually called the attack policy) and leaving the unmodified model as the aligned version. This unmodified model is used to calculate usefulness scores.
This does a couple of things, compared to the situation where usefulness is evaluated against the red team’s model. It takes a bunch of sandbagging options off the table for the red team (which is maybe fair? prompting the attacker to “just never answer correctly”) would give zero usefulness but also just seems kinda unrealistic in the presence of RL.
There’s also issues when thinking about real control evals—those done on an actual untrusted model and protocol before you deploy it—you run into issues with evaluating the capabilities of the model. If the model sandbags your capability evals, your protocol might appear to work when it actually won’t.
Responded to a bit of this in a separate thread here