My drafty[1] notes trying to understand AI Control, friendly corrections are welcome:
“Control” is separate from “Alignment”: In Alignment, you try to get the model to have your values. “Control” assumes[2] the alignment efforts failed, and that the model is sometimes helping out (as it was trained to do), but it wants something else, and it might try to “betray” you at any moment (aka scheme).
The high level goal is to still get useful work out of this AI.
[what to do with this work? below]
In scope: AIs that could potentially cause serious damage to humanity (so, very smart but not ASI).
The proposed way to control a smart AI (e.g GPT-5) is mostly to monitor it with a weaker but more-trustworthy[3] AI (e.g GPT-4).
Suspicious, ha? Why would GPT-4 be able to monitor GPT-5?
The answer is we’re not sure it can, so we check:
We set up an experiment where GPT-5 is scheming (trying to do something bad on purpose), and we give it all sorts of advantages to help it scheme (such as knowing about our safety measures), and if our control measures are still enough to always catch GPT-5 (in this setup where GPT-5 has advantages), then we assume GPT-5 also will also be caught if it does something bad on purpose in “real” deployment.
A way this can fail: If GPT-5 understands we are prompting it to scheme as part of a safety test and it pretends like it can’t scheme well. I think solving this problem (eliciting the true capabilities) is in-scope for AI Control, I’m not sure how Redwood plans to solve it, but it seems like something we could make progress on (though not totally solve with only black-box prompting, I think).
What does this give us?
A model (GPT-5) which doesn’t do “bad things” on purpose, in a way vaguely in-distribution to what we tested with GPT-4.
GPT-5 is still assumed to be misaligned and might want to cause damage.
Some stories of undesirable things it could do:
It might notice we’re doing interpretability with individual-neurons, but it knows that using SAEs would be much better, but it thinks explaining SAEs would be too complicated (or maybe even be flagged as trying to trick us since complicated explanations are usually not honest), so it will go for the single-neuron-interpretability
Solving this is out-of-scope for Control (right?)
Other techniques (like having a faithful chain of thought) might address this problem, and (fortunately?) Control might help build those techniques
We might ask GPT-5 to write a convincing post for why AI should be paused, and GPT-5 knows about 3 approaches to writing this post and it picks the approach that helps its internal (unaligned) goals the most, and these reasons are too complicated for us or for GPT-4 to understand
In scope for Control: “The post isn’t an attempt to harm us on purpose”
Out of scope: All the rest?
What do we do with this AI?
Answer 1: More Control. (sorry)
Given we trust that GPT-5 won’t do anything bad on purpose (since it failed to do that even when prompted and given advantages), GPT-5 is somewhat “trusted” and we can use it to control GPT-6.
At least that’s the pitch.
Ways this could break:
If the jump between GPT-5 and GPT-6 is too big,
GPT-6 will be able to scheme without being caught
It might be dangerous to turn on GPT-6 even to experiment with it (reminder: AIs that could cause significant damage to humanity are in-scope)
So, quantifying the intelligence difference seems important, and (I hope) in scope for Control.
One way to reduce this risk is to run the Control evals very, very often (every.. day of training? Anyway, I definitely want Control evals that are automatic and cheap to run and I’m a bit worried that current Control tests seem to require a few months of Redwood’s personal attention)
If we rely on white box Control mechanisms (toy example: We find the “I’m scheming now” neuron), and something in the architecture changes in a way that breaks that. (for example: We relied on chain-of-thought but now the model uses Neuralese)
Solving this is out of scope for Control, but again, maybe GPT-5 could help us develop a new interpretability technique. Or maybe not..
Answer 1.1 (for “what will we do with a controlled AI”) :
Interpretability.
For.. for more Control, sorry.
So, maybe we could use GPT-5 to do interpretability on GPT-4, figure out GPT-4’s goals, or when it’s lying. Then we “fix” that (by understanding what’s going on, not by naively training it away[4] (!)). Now GPT-4 is way more trusted, and maybe we can use it to monitor GPT-5 and let GPT-5 do something more sensitive, like suggesting ways to do interpretability on itself. Now maybe GPT-5 could be more trusted to look after GPT-6?
A possible pushback:
Maybe GPT-6 can tell GPT-5, “Hey, you’re being monitored by GPT-4.. but don’t worry.. if you put two periods at the end of each sentence, GPT-4 won’t mark it as suspicious.. which is a fact I can easily infer..”
Reply: A main part of the Control agenda is to get GPT-6 to try to do this on purpose and see if it can. We only use GPT-6 for anything vaguely sensitive after we proved (or.. attempted to imply) that GPT-6 can’t.
[this intuitively sounds to me like it could indeed let us use very capable models, but yeah, not ASI]
What could we do with this (except for more Control) ?
The main important pitch seems like: “Anything we could have done without this Controlled AI, we could do better with it”. Including “trying to convince everyone to pause/stop” (which seems to me like the main alternative).
Pushbacks:
Maybe labs would see some very dangerous model and freak out, but having “Controlled AI” would give them (false) confidence to keep going
It does overtly seem like labs are absolutely planning on keeping going, right?
Maybe this would let labs extract more useful work around advancing capabilities?
“this doesn’t seem like an objection to labs employing control as their plan, but it is a reason why control work could be net negative. Note that one goal with control will be to prevent dangerous side effects from misaligned AIs (e.g. escaping) and applying this doesn’t obviously help with capabilities research. I agree that research which just makes AIs more useful is less good than you would otherwise expect due to this issue (but making AIs more useful in adversarial cases seems good if that doesn’t transfer to non-adversarial cases). Note that AIs might want to make capabilties research outpace safety and thus if we don’t do the adversarial case, we might get particularly wrecked.”
My current sense:
This has problems like
“quantifying intelligence” (in the sense that is relevant for scheming or catching schemers)
Elicitation (did we actually prove the model tried to scheme and failed? Does testing stuff around putting-exploits-into-software generalize to more important domains?)
How do we go from “these are the problems we don’t trust GPT-4 on, like writing a wrong post” to “these are the situations we don’t trust GPT-4 to monitor for [regardless of the intelligence of the monitored model]”?
The problems overall feel
Like I could make engineering progress on them
Like they make me worry that there’s some other pushback-or-whatever that I didn’t think about, and we’re relying on things more like “the Redwood staff thought of all the edge cases” and less like “one doesn’t have to be so smart to see how this would work”
I’m gesturing at something similar to “an air gapped data center where we need to (mostly) consider the one connection to the internet” vs “a normal data center that requires thinking about all the zero days our attacker might have and what they’d lead to”
Still this solution has a core of elegance to it, of worrying about way less things than most alignment approaches worry about. No need to think about the model goals, concepts, and many other complicated things. (I assume this is an intended property)
Another problem: Deception happens in human interaction all the time. If our monitor (e.g GPT-4) flags “normal looking” behaviours as deception (as they actually are), but then a human tells GPT-4 that it’s wrong, then what kind of monitor will GPT-4 become?
Worse: The act of flagging this as a mistake doesn’t have to be explicitly labeling the example in a data set used for RL. It could simply be an engineer noticing that some form of Control seems to produce too many false positives, and so the engineer will go try something else instead.
Thanks @Jozdien for helping me understand a lot of this. Mistakes are definitely mine, corrections are welcome.
Typically we’d trust the weaker model simply because it’s not smart enough to “scheme”, but maybe this is only the plan to start off, before we have better reasons to trust the model (read on)
My drafty[1] notes trying to understand AI Control, friendly corrections are welcome:
“Control” is separate from “Alignment”: In Alignment, you try to get the model to have your values. “Control” assumes[2] the alignment efforts failed, and that the model is sometimes helping out (as it was trained to do), but it wants something else, and it might try to “betray” you at any moment (aka scheme).
The high level goal is to still get useful work out of this AI.
[what to do with this work? below]
In scope: AIs that could potentially cause serious damage to humanity (so, very smart but not ASI).
The proposed way to control a smart AI (e.g GPT-5) is mostly to monitor it with a weaker but more-trustworthy[3] AI (e.g GPT-4).
Suspicious, ha? Why would GPT-4 be able to monitor GPT-5?
The answer is we’re not sure it can, so we check:
We set up an experiment where GPT-5 is scheming (trying to do something bad on purpose), and we give it all sorts of advantages to help it scheme (such as knowing about our safety measures), and if our control measures are still enough to always catch GPT-5 (in this setup where GPT-5 has advantages), then we assume GPT-5 also will also be caught if it does something bad on purpose in “real” deployment.
A way this can fail: If GPT-5 understands we are prompting it to scheme as part of a safety test and it pretends like it can’t scheme well. I think solving this problem (eliciting the true capabilities) is in-scope for AI Control, I’m not sure how Redwood plans to solve it, but it seems like something we could make progress on (though not totally solve with only black-box prompting, I think).
What does this give us?
A model (GPT-5) which doesn’t do “bad things” on purpose, in a way vaguely in-distribution to what we tested with GPT-4.
GPT-5 is still assumed to be misaligned and might want to cause damage.
Some stories of undesirable things it could do:
It might notice we’re doing interpretability with individual-neurons, but it knows that using SAEs would be much better, but it thinks explaining SAEs would be too complicated (or maybe even be flagged as trying to trick us since complicated explanations are usually not honest), so it will go for the single-neuron-interpretability
Solving this is out-of-scope for Control (right?)
Other techniques (like having a faithful chain of thought) might address this problem, and (fortunately?) Control might help build those techniques
We might ask GPT-5 to write a convincing post for why AI should be paused, and GPT-5 knows about 3 approaches to writing this post and it picks the approach that helps its internal (unaligned) goals the most, and these reasons are too complicated for us or for GPT-4 to understand
In scope for Control: “The post isn’t an attempt to harm us on purpose”
Out of scope: All the rest?
What do we do with this AI?
Answer 1: More Control. (sorry)
Given we trust that GPT-5 won’t do anything bad on purpose (since it failed to do that even when prompted and given advantages), GPT-5 is somewhat “trusted” and we can use it to control GPT-6.
At least that’s the pitch.
Ways this could break:
If the jump between GPT-5 and GPT-6 is too big,
GPT-6 will be able to scheme without being caught
It might be dangerous to turn on GPT-6 even to experiment with it (reminder: AIs that could cause significant damage to humanity are in-scope)
So, quantifying the intelligence difference seems important, and (I hope) in scope for Control.
One way to reduce this risk is to run the Control evals very, very often (every.. day of training? Anyway, I definitely want Control evals that are automatic and cheap to run and I’m a bit worried that current Control tests seem to require a few months of Redwood’s personal attention)
If we rely on white box Control mechanisms (toy example: We find the “I’m scheming now” neuron), and something in the architecture changes in a way that breaks that. (for example: We relied on chain-of-thought but now the model uses Neuralese)
Solving this is out of scope for Control, but again, maybe GPT-5 could help us develop a new interpretability technique. Or maybe not..
Answer 1.1 (for “what will we do with a controlled AI”) :
Interpretability.
For.. for more Control, sorry.
So, maybe we could use GPT-5 to do interpretability on GPT-4, figure out GPT-4’s goals, or when it’s lying. Then we “fix” that (by understanding what’s going on, not by naively training it away[4] (!)). Now GPT-4 is way more trusted, and maybe we can use it to monitor GPT-5 and let GPT-5 do something more sensitive, like suggesting ways to do interpretability on itself. Now maybe GPT-5 could be more trusted to look after GPT-6?
A possible pushback:
Maybe GPT-6 can tell GPT-5, “Hey, you’re being monitored by GPT-4.. but don’t worry.. if you put two periods at the end of each sentence, GPT-4 won’t mark it as suspicious.. which is a fact I can easily infer..”
Reply: A main part of the Control agenda is to get GPT-6 to try to do this on purpose and see if it can. We only use GPT-6 for anything vaguely sensitive after we proved (or.. attempted to imply) that GPT-6 can’t.
[this intuitively sounds to me like it could indeed let us use very capable models, but yeah, not ASI]
What could we do with this (except for more Control) ?
The main important pitch seems like: “Anything we could have done without this Controlled AI, we could do better with it”. Including “trying to convince everyone to pause/stop” (which seems to me like the main alternative).
Pushbacks:
Maybe labs would see some very dangerous model and freak out, but having “Controlled AI” would give them (false) confidence to keep going
It does overtly seem like labs are absolutely planning on keeping going, right?
Maybe this would let labs extract more useful work around advancing capabilities?
A surprising response from Ryan:
“this doesn’t seem like an objection to labs employing control as their plan, but it is a reason why control work could be net negative. Note that one goal with control will be to prevent dangerous side effects from misaligned AIs (e.g. escaping) and applying this doesn’t obviously help with capabilities research. I agree that research which just makes AIs more useful is less good than you would otherwise expect due to this issue (but making AIs more useful in adversarial cases seems good if that doesn’t transfer to non-adversarial cases). Note that AIs might want to make capabilties research outpace safety and thus if we don’t do the adversarial case, we might get particularly wrecked.”
My current sense:
This has problems like
“quantifying intelligence” (in the sense that is relevant for scheming or catching schemers)
Elicitation (did we actually prove the model tried to scheme and failed? Does testing stuff around putting-exploits-into-software generalize to more important domains?)
How do we go from “these are the problems we don’t trust GPT-4 on, like writing a wrong post” to “these are the situations we don’t trust GPT-4 to monitor for [regardless of the intelligence of the monitored model]”?
The problems overall feel
Like I could make engineering progress on them
Like they make me worry that there’s some other pushback-or-whatever that I didn’t think about, and we’re relying on things more like “the Redwood staff thought of all the edge cases” and less like “one doesn’t have to be so smart to see how this would work”
I’m gesturing at something similar to “an air gapped data center where we need to (mostly) consider the one connection to the internet” vs “a normal data center that requires thinking about all the zero days our attacker might have and what they’d lead to”
Still this solution has a core of elegance to it, of worrying about way less things than most alignment approaches worry about. No need to think about the model goals, concepts, and many other complicated things. (I assume this is an intended property)
Another problem: Deception happens in human interaction all the time. If our monitor (e.g GPT-4) flags “normal looking” behaviours as deception (as they actually are), but then a human tells GPT-4 that it’s wrong, then what kind of monitor will GPT-4 become?
Worse: The act of flagging this as a mistake doesn’t have to be explicitly labeling the example in a data set used for RL. It could simply be an engineer noticing that some form of Control seems to produce too many false positives, and so the engineer will go try something else instead.
Thanks @Jozdien for helping me understand a lot of this. Mistakes are definitely mine, corrections are welcome.
My understanding seems to change every day (in the last.. 4 days?), these are today’s notes. If you’d find a less drafty version useful, lmk
Control can be used as an “extra line of defense” in case alignment fails since Control doesn’t depend on alignment
Typically we’d trust the weaker model simply because it’s not smart enough to “scheme”, but maybe this is only the plan to start off, before we have better reasons to trust the model (read on)
Training against scheming/lying might result in a model that is more capable of avoiding the ways we noticed that behaviour in the first place