Here’s a story for how we could get lots of AI help with AI safety research even if schemers are somewhat common and diffuse control doesn’t work to get them to help us:
Let’s say a company successfully solves control for “high-stakes”/”concentrated” threats, meaning that AIs can safely be deployed without a large risk of them hacking their own datacenters or exfiltrating or causing any similar catastrophic action. (At a given level of capability.)
However, let’s say that control fails to solve the problem of getting the AIs to try hard at doing good AI safety research. When asked to do research, perhaps the schemers are sloppy, or perhaps they are actively trying to mislead us.
Still, the AI company can now safely train and deploy lots of different AIs without fear of them exfiltrating or hacking the lab. Which gives the AI company lots of “shots on goal” of training AIs that generalize to earnestly helping us with automated alignment research. Maybe they develop a zoo with like 20 separate countries of geniuses in data centres, including:
Maximally instruction-following AIs, instructed to help us in various ways.
Consequentialist AIs trained to care about human values.
Consequentialist AIs trained to play-the-training-game/scheme on behalf of human values. (So that they don’t get replaced by other schemers. Similar to inoculation prompting.)
AIs trained to just love alignment research (or some other project) in their heart. They’re just like a curious scientist who is obsessed with solving alignment.
AIs designed to be maximally amenable to making deals, and then offered some great payment in exchange for AI safety research progress.
AI intentionally trained to be reward-seekers — and told that they will get maximum reward if they’re eventually judged to have done excellent AI safety research.
...
Perhaps most of these end up scheming against the AI company — but some end up genuinely aligned.
The company probably can’t initially tell apart the schemers from the aligned AIs, or indeed, tell which of the projects are succeeding at a day-to-day basis. (It’s just too hard to evaluate which research directions are genuinely promising vs. a waste of time.) But they run them all in parallel, and after several months of massively sped-up work, it’s likely much easier to adjudicate whose solutions makes sense to adopt.
This is similar to how, in well-functioning scientific fields, it’s not typically possible to get consensus on what speculative research directions are most promising. But once a research project or whole research agenda is completed, there are normally ways in which the value-add can be demonstrated.
And maybe all that automated labor is enough to save the day.
Downsides with this approach:
If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would’ve had if you concentrated your bets. This might be costly.
You’re increasing the risk of deploying at least one schemer. So you’re increasing the risk of high-stakes catastrophes like AIs hacking your data center. (Which is why it’s important that concentrated/high-stakes control is going well enough that the baseline risk is low.)
(Note that you could use a similar method of “wide variety of diverse AIs” as monitors, which could be very helpful for high-stakes control.)
On the bright side, you’re probably increasing your chances of catching at least one AI red-handed, which could be useful for research or for gathering legible evidence that risk is high.
With such evidence in hand, maybe you could convince others to coordinate to slow down, so that you don’t have to pursue the crazy plan of “deploy dozens of different AI models, because any given model is pretty likely to be scheming against you”.
Overall, thinking about this makes me:
Somewhat more bullish on the value of high-stakes/concentrated control research.
More bullish on developing somewhat dubious ideas for how to prevent scheming. Even if an idea might increase scheming risk in expectation (!) if it just has a bunch of uncertainty/randomness that’s independent from other approaches, then it could still add value.
In discussions about whether AI companies should write their model specs to say X or Y, I’m more inclined to say that, at least internally, they should hedge their bets by deploying models of both the X variant and the Y variant.
(Though depending on what the X/Y disagreement is, this might not resolve the problem. E.g. if the concern is that the X variant will be too easy to misuse by AI company insiders, then having the X model be available as one model might be almost as bad as it having it be the AI company’s only model.)
Invest in second-best techniques (basically what you say about “developing somewhat dubious ideas for how to prevent scheming”)
Modular systems—i.e. architectures where you can “toggle” some variable at inference-time to give you more variants, e.g. gradient routing
Prioritize late-stage techniques, i.e. techniques which apply late in AI lifecycle (e.g. post-training, prompting, activation steering, scaffolding) and deprioritise techniques apply early in the AI lifecycle (e.g. architecture design and pretraining). because this makes it cheaper to develop many variants
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel. That is, I disagree with “If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would’ve had if you concentrated your bets.” This is because I think we can run things like debate between all the variants.
One upshot of this difference is that I expect coordination between labs to matter significantly: if lab A starts with 100% schemers and lab B starts with 5% schemers, then we’ll elicit useful work from the AIs if the labs cross-examine each other’s AI research with their own AI variants.
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel.
Yeah, I think the case for diverse monitors to achieve high-stakes control is actually a lot more straightforward. It’s easy to implement a monitor in a way such that it has very little ability to autonomously cause bad things to happen, so basically the only cost is the extra compute it takes to train and run it. And in the high-stakes/concentrated failures regime, humans will typically be able to recognize attempted catastrophes once pointed out to them.
I agree that debate-style approaches seem good. I do think it seems a lot more feasible to judge such debates once the AIs have already accomplished a lot of work, vs. early on in the research when AIs are making taste-driven judgments about direction, so it complements nicely with letting AIs do a lot of work in parallel also.
a company successfully solves control for “high-stakes”/”concentrated” threats, meaning that AIs can safely be deployed without a large risk of them hacking their own datacenters or exfiltrating or causing any similar catastrophic action.
This seems like quite a difficult task, especially since we (or some other trusted agent) need to read the outputs of these AIs to get any value from their safety research. Schemers could self-exfiltrate by manipulating us somehow.
The company probably can’t initially tell apart the schemers from the aligned AIs, or indeed, tell which of the projects are succeeding at a day-to-day basis. (It’s just too hard to evaluate which research directions are genuinely promising vs. a waste of time.) But they run them all in parallel, and after several months of massively sped-up work, it’s likely much easier to adjudicate whose solutions makes sense to adopt.
This is similar to how, in well-functioning scientific fields, it’s not typically possible to get consensus on what speculative research directions are most promising. But once a research project or whole research agenda is completed, there are normally ways in which the value-add can be demonstrated.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work. That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
I think self-exfiltration via manipulation seems pretty hard. I think we’re likely to have transformatively useful systems that can’t do that, for some amount of time. (Especially since there’s no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
Some scientists do optimize for (something like) “impressiveness” of their work, regardless of whether it’s good or bad. It’s true that they don’t intentionally compromise impressiveness in order to make the work more misleading. That said, if some models optimize for impressiveness, and some models compromise impressiveness to make their work more misleading, then I guess the humans should be somewhat more likely to use the more impressive work? So maybe that limits the degree of influence that intentionally misleading work could I have.
I am definitely pretty worried about humans ability to judge what work is good or bad. In addition to the disanalogy you mention, I’d also highlight (i) non-experts typically struggle to judge which expert is right in a dispute, and if humans are slow & dumb enough relative to the AIs, humans might be analogous to non-experts, and (ii) it’s plausible to me that human science benefits a lot from the majority of scientists (and maybe especially the most capable scientists?) having at least a weak preference for truth, e.g. because it’s easier to recognize a few bad results when you can lean on the presumed veracity of most established results.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work
if you ask each variant to review the research of the other variants, then the schemers need to optimise their research for looking good to each variant. but the optimistic assumption is that at least one variant is an equally capable non-schemer.
then the schemers need to optimise their research for looking good to each variant
Not necessarily. The nice AIs also need to be able to win a debate against the schemers, as judged by humans. It’s not enough for the variants to be able to recognize poor research if they can’t show their work in an unrefutable (by the schemers) way.
Here’s a story for how we could get lots of AI help with AI safety research even if schemers are somewhat common and diffuse control doesn’t work to get them to help us:
Let’s say a company successfully solves control for “high-stakes”/”concentrated” threats, meaning that AIs can safely be deployed without a large risk of them hacking their own datacenters or exfiltrating or causing any similar catastrophic action. (At a given level of capability.)
However, let’s say that control fails to solve the problem of getting the AIs to try hard at doing good AI safety research. When asked to do research, perhaps the schemers are sloppy, or perhaps they are actively trying to mislead us.
Still, the AI company can now safely train and deploy lots of different AIs without fear of them exfiltrating or hacking the lab. Which gives the AI company lots of “shots on goal” of training AIs that generalize to earnestly helping us with automated alignment research. Maybe they develop a zoo with like 20 separate countries of geniuses in data centres, including:
Maximally instruction-following AIs, instructed to help us in various ways.
Consequentialist AIs trained to care about human values.
Consequentialist AIs trained to play-the-training-game/scheme on behalf of human values. (So that they don’t get replaced by other schemers. Similar to inoculation prompting.)
AIs trained to just love alignment research (or some other project) in their heart. They’re just like a curious scientist who is obsessed with solving alignment.
AIs designed to be maximally amenable to making deals, and then offered some great payment in exchange for AI safety research progress.
AI intentionally trained to be reward-seekers — and told that they will get maximum reward if they’re eventually judged to have done excellent AI safety research.
...
Perhaps most of these end up scheming against the AI company — but some end up genuinely aligned.
The company probably can’t initially tell apart the schemers from the aligned AIs, or indeed, tell which of the projects are succeeding at a day-to-day basis. (It’s just too hard to evaluate which research directions are genuinely promising vs. a waste of time.) But they run them all in parallel, and after several months of massively sped-up work, it’s likely much easier to adjudicate whose solutions makes sense to adopt.
This is similar to how, in well-functioning scientific fields, it’s not typically possible to get consensus on what speculative research directions are most promising. But once a research project or whole research agenda is completed, there are normally ways in which the value-add can be demonstrated.
And maybe all that automated labor is enough to save the day.
Downsides with this approach:
If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would’ve had if you concentrated your bets. This might be costly.
You’re increasing the risk of deploying at least one schemer. So you’re increasing the risk of high-stakes catastrophes like AIs hacking your data center. (Which is why it’s important that concentrated/high-stakes control is going well enough that the baseline risk is low.)
(Note that you could use a similar method of “wide variety of diverse AIs” as monitors, which could be very helpful for high-stakes control.)
On the bright side, you’re probably increasing your chances of catching at least one AI red-handed, which could be useful for research or for gathering legible evidence that risk is high.
With such evidence in hand, maybe you could convince others to coordinate to slow down, so that you don’t have to pursue the crazy plan of “deploy dozens of different AI models, because any given model is pretty likely to be scheming against you”.
Overall, thinking about this makes me:
Somewhat more bullish on the value of high-stakes/concentrated control research.
More bullish on developing somewhat dubious ideas for how to prevent scheming. Even if an idea might increase scheming risk in expectation (!) if it just has a bunch of uncertainty/randomness that’s independent from other approaches, then it could still add value.
In discussions about whether AI companies should write their model specs to say X or Y, I’m more inclined to say that, at least internally, they should hedge their bets by deploying models of both the X variant and the Y variant.
(Though depending on what the X/Y disagreement is, this might not resolve the problem. E.g. if the concern is that the X variant will be too easy to misuse by AI company insiders, then having the X model be available as one model might be almost as bad as it having it be the AI company’s only model.)
I explore similar considerations here in The Case for Mixed Deployment (5 min read), key takeaways were:
Invest in second-best techniques (basically what you say about “developing somewhat dubious ideas for how to prevent scheming”)
Modular systems—i.e. architectures where you can “toggle” some variable at inference-time to give you more variants, e.g. gradient routing
Prioritize late-stage techniques, i.e. techniques which apply late in AI lifecycle (e.g. post-training, prompting, activation steering, scaffolding) and deprioritise techniques apply early in the AI lifecycle (e.g. architecture design and pretraining). because this makes it cheaper to develop many variants
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel. That is, I disagree with “If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would’ve had if you concentrated your bets.” This is because I think we can run things like debate between all the variants.
One upshot of this difference is that I expect coordination between labs to matter significantly: if lab A starts with 100% schemers and lab B starts with 5% schemers, then we’ll elicit useful work from the AIs if the labs cross-examine each other’s AI research with their own AI variants.
Oh nice, thanks!
Yeah, I think the case for diverse monitors to achieve high-stakes control is actually a lot more straightforward. It’s easy to implement a monitor in a way such that it has very little ability to autonomously cause bad things to happen, so basically the only cost is the extra compute it takes to train and run it. And in the high-stakes/concentrated failures regime, humans will typically be able to recognize attempted catastrophes once pointed out to them.
I agree that debate-style approaches seem good. I do think it seems a lot more feasible to judge such debates once the AIs have already accomplished a lot of work, vs. early on in the research when AIs are making taste-driven judgments about direction, so it complements nicely with letting AIs do a lot of work in parallel also.
This seems like quite a difficult task, especially since we (or some other trusted agent) need to read the outputs of these AIs to get any value from their safety research. Schemers could self-exfiltrate by manipulating us somehow.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work. That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
I think self-exfiltration via manipulation seems pretty hard. I think we’re likely to have transformatively useful systems that can’t do that, for some amount of time. (Especially since there’s no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
Some scientists do optimize for (something like) “impressiveness” of their work, regardless of whether it’s good or bad. It’s true that they don’t intentionally compromise impressiveness in order to make the work more misleading. That said, if some models optimize for impressiveness, and some models compromise impressiveness to make their work more misleading, then I guess the humans should be somewhat more likely to use the more impressive work? So maybe that limits the degree of influence that intentionally misleading work could I have.
I am definitely pretty worried about humans ability to judge what work is good or bad. In addition to the disanalogy you mention, I’d also highlight (i) non-experts typically struggle to judge which expert is right in a dispute, and if humans are slow & dumb enough relative to the AIs, humans might be analogous to non-experts, and (ii) it’s plausible to me that human science benefits a lot from the majority of scientists (and maybe especially the most capable scientists?) having at least a weak preference for truth, e.g. because it’s easier to recognize a few bad results when you can lean on the presumed veracity of most established results.
if you ask each variant to review the research of the other variants, then the schemers need to optimise their research for looking good to each variant. but the optimistic assumption is that at least one variant is an equally capable non-schemer.
Not necessarily. The nice AIs also need to be able to win a debate against the schemers, as judged by humans. It’s not enough for the variants to be able to recognize poor research if they can’t show their work in an unrefutable (by the schemers) way.