I’m guessing this won’t turn out to resolve your current disagreement, but I think the best articulation of this is probably in the Online Resource page: A Closer Look at Before And After.
From past discussions, it sounds like you think “the AIs are now capable of confidently taking over” is, like, <50% (at least < 60%?) likely to in practice be a substantially different an environment.
I don’t really get why. But, to be fair, on my end, I also don’t really have much more gears underneath the hood of “obviously, it’s way different to run tests and interventions on someone who isn’t capable of confidently taking over, vs someone who is, because they just actually have the incentive to defect in the latter case and mostly don’t in the former”. It seems like there’s just a brute difference in intuition I’m not sure what to do with?
(I agree there might be scenarios like “when the AI takeover is only 10% likely to work, it might try anyway, because it anticipates more powerful AIs coming later and now seems like it’s best shot.” That’s a reason you might get a warning shot, but, not a reason that “it can actually just confidently takeover with like 95%+ likelihood” doesn’t count as a significantly new environment once we actually get to that stage.)
Text of the relevant resource section, for reference
As mentioned in the chapter, the fundamental difficulty researchers face in AI is this:
You need to align an AI Before it is powerful enough and capable enough to kill you (or, separately, to resist being aligned). That alignment must then carry over to different conditions, the conditions After a superintelligence or set of superintelligences* could kill you if they preferred to.
In other words: If you’re building a superintelligence, you need to align it without ever being able to thoroughly test your alignment techniques in the real conditions that matter, regardless of how “empirical” your work feels when working with systems that are not powerful enough to kill you.
This is not a standard that AI researchers, or engineers in almost any field, are used to.
We often hear complaints that we are asking for something unscientific, unmoored from empirical observation. In reply, we might suggest talking to the designers of the space probes we talked about in Chapter 10.
Nature is unfair, and sometimes it gives us a case where the environment that counts is not the environment in which we can test. Still, occasionally, engineers rise to the occasion and get it right on the first try, when armed with a solid understanding of what they’re doing — robust tools, strong predictive theories — something very clearly lacking in the field of AI.
The whole problem is that the AI you can safely test, without any failed tests ever killing you, is operating under a different regime than the AI (or the AI ecosystem) that needs to have already been tested, because if it’s misaligned, then everyone dies. The former AI, or system of AIs, does not correctly perceive itself as having a realistic option of killing everyone if it wants to. The latter AI, or system of AIs, does see that option.†
Suppose that you were considering making your co-worker Bob the dictator of your country. You could try making him the mock dictator of your town first, to see if he abuses his power. But this, unfortunately, isn’t a very good test. “Order the army to intimidate the parliament and ‘oversee’ the next election” is a very different option from “abuse my mock power while being observed by townspeople (who can still beat me up and deny me the job).”
Given a sufficiently well-developed theory of cognition, you could try to read the AI’s mind and predict what cognitive state it would enter if it really did think it had the opportunity to take over.
And you could set up simulations (and try to spoof the AI’s internal sensations, and so on) in a way that your theory of cognition predicts would be very similar to the cognitive state the AI would enter once it really had the option to betray you.‡
But the link between these states that you induce and observe in the lab, and the state where the AI actually has the option to betray you, depends fundamentally on your untested theory of cognition. An AI’s mind is liable to change quite a bit as it develops into a superintelligence!
If the AI creates new successor AIs that are smarter than it, those AIs’ internals are likely to differ from the internals of the AI you studied before. When you learn only from a mind Before, any application of that knowledge to the minds that come After routes through an untested theory of how minds change between the Before and the After.
Running the AI until it has the opportunity to betray you for real, in a way that’s hard to fake, is an empirical test of those theories in an environment that differs fundamentally from any lab setting.
Many a scientist (and many a programmer) knows that their theories of how a complicated system is going to work in a fundamentally new operating environment often don’t go well on the first try.§ This is a research problem that calls for an “unfair” level of predictability, control, and theoretical insight, in a domain with unusually lowlevels of understanding — with all of our lives on the line if the experiment’s result disconfirms the engineers’ hopes.
This is why it seems overdetermined, from our perspective, that researchers should not rush ahead to push the frontier of AI as far as it can be pushed. This is a legitimately insane thing to attempt, and a legitimately insane thing for any government to let happen.
It is interesting that the analogy here is to evaluating your co worker Bob on being the mock dictator of your town and then deploying them as the actual dictator of the country.
In reality in AI there would be multiple generations of different “Bobs” (although it is possible—and needs to be studied—that they would have affinity for one another and try to communicate or influence each other via hidden means). These “bobs” will be growing in both capability but also actually deployed in the world with real-world consequences. Our current “Bobs” are starting to be deployed as coding and personal assistant for short term well-defined coding tasks, but these will grow from minutes, to hours to days to months and more. “Bob” will start doing work of an intern, then a junior worker, then more senior one, managing teams of other “Bobs”. To continue the analogy, maybe “Bob” will be the actual (non mock) mayor of a town, then a province, and maybe then rule over a whole country. But all the while all the Bobs would keep seeming perfectly helpful. (Getting 100% success in the task of hiding their true intentions, even though they are not yet good enough to get 100% success in other tasks) until the day when (by design or by accident) we make the Bobs dictators of the whole world.
I am not saying such a scenario is logically impossible. It just seems highly unlikely to me. To be clear, the part that seems unlikely is not that AI will be eventually so powerful and integrated in our systems, that it could cause catastrophic outcomes if it behaved in an arbitrarily malicious way. The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude. Of course it is also possible that we will “explain away” these failures and still end up in a very bad place. I just think that it wouldn’t be the case that we had one shot but we missed it, but rather had many shots and missed them all. This is the reason why we (alignment researchers at various labs, universities, non profits) are studying questions such as scheming, colluding, situational awareness, as well as studying methods for alignment and monitoring. We are constantly learning and updating based on what we find out.
I am wondering if there is any empirical evidence from current AIs that would modify your / @Eliezer Yudkowsky ’s expectations of how likely this scenario is to materialize.
Get to near-0 failure in alignment-loaded tasks that are within the capabilities of the model.
That is, when we run various safety evals, I’d like it if the models genuinely scored near-0. I’d also like it if the models ~never refused improperly, ~never answered when they should have refused, ~never precipitated psychosis, ~never deleted whole codebases, ~never lied in the CoT, and similar.
These are all behavioral standards, and are all problems that I’m told we’ll keep under control. I’d like the capacity for us to have them under control demonstrated currently, as a precondition of advancing the frontier.
So far, I don’t see that the prosaic plans work in the easier, near-term cases, and am being asked to believe they’ll work in the much harder future cases. They may work ‘well enough’ now, but the concern is precisely that ‘well enough’ will be insufficient in the limit.
An alternative condition is ‘full human interpretability of GPT-2 Small’.
This probably wouldn’t change my all-things-considered view, but this would substantially ‘modify my expectations’, and make me think the world was much more sane than today’s world.
I know you’re pointing out the easier case still not working, but I just want to caution against the “drive it to zero” mentality, since I worry strongly that it’s the exact mentality researchers often have.
When that’s your mental model, reducing rates will seem like progress.
The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude
IMO the default failure mode here is:
We do observe them (or early versions of them)
The lab underinvests in the problem
It becomes enough of a problem that its painful to product or internal capabilities usage
We didn’t invest enough to actually solve the underlying problem, and we can’t afford to not use the model while we wait for alignment research to catch up
The lab patches over the problem with some “reduces but does not eliminate” technique
The model is then usable, but with harder to detect misalignment
Scale capabilities and repeat
This is the exact loop we’re in now, and the dynamics only intensify with time and capabilities.
The situation you’re describing definitely concerns me, and is about mid-way up the hierarchy of nested problems as I see it (I don’t mean ‘hierarchy of importance’ I mean ’spectrum from object-level-empirical-work to realm-of-pure-abstraction).
I tried to capture this at the end of my comment, by saying that even success as I outlined it probably wouldn’t change my all-things-considered view (because there’s a whole suite of nested problems at other levels of abstraction, including the one you named), but it would at least update me toward the plausibility of the case they’re making.
As is, their own tests say they’re doing poorly, and they’ll probably want to fix that in good faith before they try tackling the kind of dynamic group epistemic failures that you’re pointing at.
I just think that it wouldn’t be the case that we had one shot but we missed it, but rather had many shots and missed them all.
This interpretation only works if by missed shots you mean “missed opportunities to completely solve alignment”. Otherwise you can observe multiple failures along the way and fix observable scheming, but you only need to miss one alignment failure on the last capability level. The point is just that your monitoring methods, even improved after many failures to catch scheming in pre-takeover regime, are finally tested only when AI is really can take over. Because real ability to take over is hard to fake. And you can’t repeat this test after you improved your monitoring, if you failed. Maybe your alignment training after previous observed failure in pre-takeover regime really made AI non-scheming. But maybe you just missed some short thought where AI decided to not think about takeover since it can’t win yet. And you’ll need to rely on your monitoring without actually testing whether it can catch all such possibilities that depend on actual environment that allows takeover.
I’m guessing this won’t turn out to resolve your current disagreement, but I think the best articulation of this is probably in the Online Resource page: A Closer Look at Before And After.
From past discussions, it sounds like you think “the AIs are now capable of confidently taking over” is, like, <50% (at least < 60%?) likely to in practice be a substantially different an environment.
I don’t really get why. But, to be fair, on my end, I also don’t really have much more gears underneath the hood of “obviously, it’s way different to run tests and interventions on someone who isn’t capable of confidently taking over, vs someone who is, because they just actually have the incentive to defect in the latter case and mostly don’t in the former”. It seems like there’s just a brute difference in intuition I’m not sure what to do with?
(I agree there might be scenarios like “when the AI takeover is only 10% likely to work, it might try anyway, because it anticipates more powerful AIs coming later and now seems like it’s best shot.” That’s a reason you might get a warning shot, but, not a reason that “it can actually just confidently takeover with like 95%+ likelihood” doesn’t count as a significantly new environment once we actually get to that stage.)
Text of the relevant resource section, for reference
It is interesting that the analogy here is to evaluating your co worker Bob on being the mock dictator of your town and then deploying them as the actual dictator of the country.
In reality in AI there would be multiple generations of different “Bobs” (although it is possible—and needs to be studied—that they would have affinity for one another and try to communicate or influence each other via hidden means). These “bobs” will be growing in both capability but also actually deployed in the world with real-world consequences. Our current “Bobs” are starting to be deployed as coding and personal assistant for short term well-defined coding tasks, but these will grow from minutes, to hours to days to months and more. “Bob” will start doing work of an intern, then a junior worker, then more senior one, managing teams of other “Bobs”. To continue the analogy, maybe “Bob” will be the actual (non mock) mayor of a town, then a province, and maybe then rule over a whole country. But all the while all the Bobs would keep seeming perfectly helpful. (Getting 100% success in the task of hiding their true intentions, even though they are not yet good enough to get 100% success in other tasks) until the day when (by design or by accident) we make the Bobs dictators of the whole world.
I am not saying such a scenario is logically impossible. It just seems highly unlikely to me. To be clear, the part that seems unlikely is not that AI will be eventually so powerful and integrated in our systems, that it could cause catastrophic outcomes if it behaved in an arbitrarily malicious way. The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude. Of course it is also possible that we will “explain away” these failures and still end up in a very bad place. I just think that it wouldn’t be the case that we had one shot but we missed it, but rather had many shots and missed them all. This is the reason why we (alignment researchers at various labs, universities, non profits) are studying questions such as scheming, colluding, situational awareness, as well as studying methods for alignment and monitoring. We are constantly learning and updating based on what we find out.
I am wondering if there is any empirical evidence from current AIs that would modify your / @Eliezer Yudkowsky ’s expectations of how likely this scenario is to materialize.
Get to near-0 failure in alignment-loaded tasks that are within the capabilities of the model.
That is, when we run various safety evals, I’d like it if the models genuinely scored near-0. I’d also like it if the models ~never refused improperly, ~never answered when they should have refused, ~never precipitated psychosis, ~never deleted whole codebases, ~never lied in the CoT, and similar.
These are all behavioral standards, and are all problems that I’m told we’ll keep under control. I’d like the capacity for us to have them under control demonstrated currently, as a precondition of advancing the frontier.
So far, I don’t see that the prosaic plans work in the easier, near-term cases, and am being asked to believe they’ll work in the much harder future cases. They may work ‘well enough’ now, but the concern is precisely that ‘well enough’ will be insufficient in the limit.
An alternative condition is ‘full human interpretability of GPT-2 Small’.
This probably wouldn’t change my all-things-considered view, but this would substantially ‘modify my expectations’, and make me think the world was much more sane than today’s world.
I know you’re pointing out the easier case still not working, but I just want to caution against the “drive it to zero” mentality, since I worry strongly that it’s the exact mentality researchers often have.
When that’s your mental model, reducing rates will seem like progress.
IMO the default failure mode here is:
We do observe them (or early versions of them)
The lab underinvests in the problem
It becomes enough of a problem that its painful to product or internal capabilities usage
We didn’t invest enough to actually solve the underlying problem, and we can’t afford to not use the model while we wait for alignment research to catch up
The lab patches over the problem with some “reduces but does not eliminate” technique
The model is then usable, but with harder to detect misalignment
Scale capabilities and repeat
This is the exact loop we’re in now, and the dynamics only intensify with time and capabilities.
The situation you’re describing definitely concerns me, and is about mid-way up the hierarchy of nested problems as I see it (I don’t mean ‘hierarchy of importance’ I mean ’spectrum from object-level-empirical-work to realm-of-pure-abstraction).
I tried to capture this at the end of my comment, by saying that even success as I outlined it probably wouldn’t change my all-things-considered view (because there’s a whole suite of nested problems at other levels of abstraction, including the one you named), but it would at least update me toward the plausibility of the case they’re making.
As is, their own tests say they’re doing poorly, and they’ll probably want to fix that in good faith before they try tackling the kind of dynamic group epistemic failures that you’re pointing at.
This interpretation only works if by missed shots you mean “missed opportunities to completely solve alignment”. Otherwise you can observe multiple failures along the way and fix observable scheming, but you only need to miss one alignment failure on the last capability level. The point is just that your monitoring methods, even improved after many failures to catch scheming in pre-takeover regime, are finally tested only when AI is really can take over. Because real ability to take over is hard to fake. And you can’t repeat this test after you improved your monitoring, if you failed. Maybe your alignment training after previous observed failure in pre-takeover regime really made AI non-scheming. But maybe you just missed some short thought where AI decided to not think about takeover since it can’t win yet. And you’ll need to rely on your monitoring without actually testing whether it can catch all such possibilities that depend on actual environment that allows takeover.