I chatted with John about this at a workshop last weekend, and did update noticably, although I haven’t updated all the way to his position here.
What was useful to me were the gears of:
Scheming has different implications at different stages of AI power.
Scheming is massively dangerous at high power levels. It’s not very dangerous at lower levels (except insofar as this allows the AI to bootstrap to the higher levels)
At the workshop, we distinguished:
weak AGI (with some critical threshold of generality, and some spiky competences such that it’s useful, but not better than a pretty-smart-human at taking over the world)
Von Neumann level AGI (as smart as the smartest humans, or somewhat more so)
Overwhelming Superintelligence (unbounded optimization power which can lead to all kinds of wonderful and horrible things that we need to deeply understand before running)
I don’t currently buy that control approaches won’t generalize from weak AGI to Von Neumann levels.
It becomes harder, and there might be limitations on what we can get out of the Von Neumann AI because we can’t be that confident in our control scheme. But I think there are lots of ways to leverage an AI that isn’t “it goes and up does long-ranging theorizing of it’s own to come up with wholesale ideas you have to evaluate.”
I think figuring out how to do this involves some pieces that aren’t particularly about “control”, but mostly don’t feel mysterious to me.
(fyi I’m one of the people who asked John to write this)
Yeah, John’s position seems to require “it doesn’t matter whether huge numbers of Von Neumann level AGIs are scheming against you”, which seems crazy to me.
No, more like a disjunction of possibilities along the lines of:
The critical AGIs come before huge numbers of von Neumann level AGIs.
At that level, really basic stuff like “just look at the chain of thought” turns out to still work well enough, so scheming isn’t a hard enough problem to be a bottleneck.
Scheming turns out to not happen by default in a bunch of von Neumann level AGIs, or is at least not successful at equilibrium (e.g. because the AIs don’t fully cooperate with each other).
“huge numbers of von Neumann level AGIs” and/or “scheming” turns out to be the wrong thing to picture in the first place, the future is Weirder than that in ways which make our intuitions about von Neumann society and/or scheming not transfer at all.
Pile together the probability mass on those sorts of things, and it seems far more probable than the prototypical scheming story.
In your post, you emphasize the slop problem. I think that the slop problem is probably much harder to solve if those AIs are scheming. I guess you’re saying that it’s just unlikely that the AIs are scheming at the point where you’re worried about the slop problem?
Yeah, basically. Or at least unlikely that they’re scheming enough or competently enough for it to be the main issue.
For instance, consider today’s AIs. If we keep getting slop at roughly the current level, and scheming at roughly the current level, then slop is going to be the far bigger barrier to using these things to align superintelligence (or nearer-but-strong intelligence).
I think part of John’s belief is more like “the current Control stuff won’t transfer to the society of Von Neumanns.”
And, my belief is “I also not sure that Control Stuff will transfer to large societies of Von Neumanns, but, also, I think you can be deliberate about how many Von Neumanns you’re summoning, and how you summon them, such that you can manage it.” (I’m not quite sure what you, Buck, are imagining yet)
My sense is that the control techniques we’ve talked about (e.g. these, and some other stuff that we’ve thought of since then) would indeed drastically reduce the risk posed by those Von Neumanns, especially if you run control evaluations with the Von Neumanns.
For example, consider the problem “I want to ask the AI a question that has a numerical answer and that would take a human 8 hours to answer, and I’m worried that occasionally on a subdomain of its choice it will blatantly lie to me”. If you do no countermeasures, it seems easy for the AI to cause this problem. But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that). So I think that adopting well-tuned variations of basic techniques seems really promising.
But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that).
Strong disagree with this. Probably not the most cruxy thing for us, but I’ll briefly walk through the issue because it is a special case of a common cruxy issue with a lot of prosaic alignment work.
The reason this doesn’t work is that the prototypical “blatant lie” doesn’t look like “the model chooses a random number to output”. The prototypical blatant lie is that there’s a subtle natural mistake one could make in reasoning about the question, the model “knows” that it’s a mistake, but the model just presents an argument with the subtle mistake in it.
Or, alternatively: the model “knows” what kind of argument would seem most natural to the user, and presents that kind of argument despite “knowing” that it systematically overlooks major problems.
Or, alternatively: human experts in the field generally believe a particular wrong answer. Maybe there’s a few dissenters, and the model is clever enough to recognize that the dissenters are correct, but also the model was RLHF’d and got positive feedback for matching supposed-experts’ answers even when those experts were wrong, so it outputs the answer and arguments the human experts in the field would generally give.
These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal “knows” the right answer in some sense.
Yeah I strongly disagree that this failure mode is very likely to be very hard to mostly resolve. I’m doing some research right now that will hopefully shed some light on this!
To be clear, I am not claiming that this failure mode is very likely very hard to resolve. Just harder than “run it twice on the original question and a rephrasing/transformation of the question”.
The reason this doesn’t work is that the prototypical “blatant lie” doesn’t look like “the model chooses a random number to output”. The prototypical blatant lie is that there’s a subtle natural mistake one could make in reasoning about the question, the model “knows” that it’s a mistake, but the model just presents an argument with the subtle mistake in it.
[...]
These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal “knows” the right answer in some sense.
As in, the claim is that there is almost always a “schelling” mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn’t make much difference?
(“Almost always” is because a key thing (that Buck didn’t specify) is that if you catch suspicious behavior you can investigate and/or train. I think the question of how useful it is to catch a large number of cases where the AI does this might be an improtant crux.)
Supposing it is the schelling mistake claim, I’m skeptical.
I also think that we could directly look for potential schelling mistake areas and investigate these specific points in more detail. This might mean the AI also needs to consistently find schelling mistakes which are non-obvious to humans or needs to be very robust to our more targeted techniques applied to the potential problem spots.
As in, the claim is that there is almost always a “schelling” mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn’t make much difference?
I think part of John’s belief is more like “the current Control stuff won’t transfer to the society of Von Neumanns.”
That is also separately part of my belief, medium confidence, depends heavily on which specific thing we’re talking about. Notably that’s more a criticism of broad swaths of prosaic alignment work than of control specifically.
I chatted with John about this at a workshop last weekend, and did update noticably, although I haven’t updated all the way to his position here.
What was useful to me were the gears of:
Scheming has different implications at different stages of AI power.
Scheming is massively dangerous at high power levels. It’s not very dangerous at lower levels (except insofar as this allows the AI to bootstrap to the higher levels)
At the workshop, we distinguished:
weak AGI (with some critical threshold of generality, and some spiky competences such that it’s useful, but not better than a pretty-smart-human at taking over the world)
Von Neumann level AGI (as smart as the smartest humans, or somewhat more so)
Overwhelming Superintelligence (unbounded optimization power which can lead to all kinds of wonderful and horrible things that we need to deeply understand before running)
I don’t currently buy that control approaches won’t generalize from weak AGI to Von Neumann levels.
It becomes harder, and there might be limitations on what we can get out of the Von Neumann AI because we can’t be that confident in our control scheme. But I think there are lots of ways to leverage an AI that isn’t “it goes and up does long-ranging theorizing of it’s own to come up with wholesale ideas you have to evaluate.”
I think figuring out how to do this involves some pieces that aren’t particularly about “control”, but mostly don’t feel mysterious to me.
(fyi I’m one of the people who asked John to write this)
Yeah, John’s position seems to require “it doesn’t matter whether huge numbers of Von Neumann level AGIs are scheming against you”, which seems crazy to me.
No, more like a disjunction of possibilities along the lines of:
The critical AGIs come before huge numbers of von Neumann level AGIs.
At that level, really basic stuff like “just look at the chain of thought” turns out to still work well enough, so scheming isn’t a hard enough problem to be a bottleneck.
Scheming turns out to not happen by default in a bunch of von Neumann level AGIs, or is at least not successful at equilibrium (e.g. because the AIs don’t fully cooperate with each other).
“huge numbers of von Neumann level AGIs” and/or “scheming” turns out to be the wrong thing to picture in the first place, the future is Weirder than that in ways which make our intuitions about von Neumann society and/or scheming not transfer at all.
Pile together the probability mass on those sorts of things, and it seems far more probable than the prototypical scheming story.
In your post, you emphasize the slop problem. I think that the slop problem is probably much harder to solve if those AIs are scheming. I guess you’re saying that it’s just unlikely that the AIs are scheming at the point where you’re worried about the slop problem?
Yeah, basically. Or at least unlikely that they’re scheming enough or competently enough for it to be the main issue.
For instance, consider today’s AIs. If we keep getting slop at roughly the current level, and scheming at roughly the current level, then slop is going to be the far bigger barrier to using these things to align superintelligence (or nearer-but-strong intelligence).
I think part of John’s belief is more like “the current Control stuff won’t transfer to the society of Von Neumanns.”
And, my belief is “I also not sure that Control Stuff will transfer to large societies of Von Neumanns, but, also, I think you can be deliberate about how many Von Neumanns you’re summoning, and how you summon them, such that you can manage it.” (I’m not quite sure what you, Buck, are imagining yet)
My sense is that the control techniques we’ve talked about (e.g. these, and some other stuff that we’ve thought of since then) would indeed drastically reduce the risk posed by those Von Neumanns, especially if you run control evaluations with the Von Neumanns.
For example, consider the problem “I want to ask the AI a question that has a numerical answer and that would take a human 8 hours to answer, and I’m worried that occasionally on a subdomain of its choice it will blatantly lie to me”. If you do no countermeasures, it seems easy for the AI to cause this problem. But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that). So I think that adopting well-tuned variations of basic techniques seems really promising.
Strong disagree with this. Probably not the most cruxy thing for us, but I’ll briefly walk through the issue because it is a special case of a common cruxy issue with a lot of prosaic alignment work.
The reason this doesn’t work is that the prototypical “blatant lie” doesn’t look like “the model chooses a random number to output”. The prototypical blatant lie is that there’s a subtle natural mistake one could make in reasoning about the question, the model “knows” that it’s a mistake, but the model just presents an argument with the subtle mistake in it.
Or, alternatively: the model “knows” what kind of argument would seem most natural to the user, and presents that kind of argument despite “knowing” that it systematically overlooks major problems.
Or, alternatively: human experts in the field generally believe a particular wrong answer. Maybe there’s a few dissenters, and the model is clever enough to recognize that the dissenters are correct, but also the model was RLHF’d and got positive feedback for matching supposed-experts’ answers even when those experts were wrong, so it outputs the answer and arguments the human experts in the field would generally give.
These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal “knows” the right answer in some sense.
Yeah I strongly disagree that this failure mode is very likely to be very hard to mostly resolve. I’m doing some research right now that will hopefully shed some light on this!
To be clear, I am not claiming that this failure mode is very likely very hard to resolve. Just harder than “run it twice on the original question and a rephrasing/transformation of the question”.
As in, the claim is that there is almost always a “schelling” mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn’t make much difference?
(“Almost always” is because a key thing (that Buck didn’t specify) is that if you catch suspicious behavior you can investigate and/or train. I think the question of how useful it is to catch a large number of cases where the AI does this might be an improtant crux.)
Supposing it is the schelling mistake claim, I’m skeptical.
I also think that we could directly look for potential schelling mistake areas and investigate these specific points in more detail. This might mean the AI also needs to consistently find schelling mistakes which are non-obvious to humans or needs to be very robust to our more targeted techniques applied to the potential problem spots.
The latter.
That is also separately part of my belief, medium confidence, depends heavily on which specific thing we’re talking about. Notably that’s more a criticism of broad swaths of prosaic alignment work than of control specifically.