I deny that gradualism obviates the “first critical try / failure under critical load” problem. This is something you believe, not something I believe. Let’s say you’re raising 1 dragon in your city, and 1 dragon is powerful enough to eat your whole city if it wants. Then no matter how much experience you think you have with a little baby dragon, once the dragon is powerful enough to actually defeat your military and burn your city, you need the experience with the little baby passively-safe weak dragon, to generalize oneshot correctly to the dragon powerful enough to burn your city. What if the dragon matures in a decade instead of a day? You are still faced with the problem of correct oneshot generalization. What if there are 100 dragons instead of 1 dragon, all with different people who think they own dragons and that the dragons are ‘theirs’ and will serve their interests, and they mature at slightly different rates? You still need to have correctly generalized the safely-obtainable evidence from ‘dragon groups not powerful enough to eat you while you don’t yet know how to control them’ to the different non-training distribution ‘dragon groups that will eat you if you have already made a mistake’. The leap of death is not something that goes away if you spread it over time or slice it up into pieces. This ought to be common sense; there isn’t some magical way of controlling 100 dragons which at no point involves the risk that the clever plan for controlling 100 dragons turns out not to work. There is no clever plan for generalizing from safe regimes to unsafe regimes which avoids all risk that the generalization doesn’t work as you hoped. Because they are different regimes. The dragon or collective of dragons is still big and powerful and it kills you if you made a mistake and you need to learn in regimes where mistakes don’t kill you and those are not the same regimes as the regimes where a mistake kills you. If you think I am trying to say something clever and complicated that could have a clever complicated rejoinder then you are not understanding the idea I am trying to convey. Between the world of 100 dragons that can kill you, and a smaller group of dragons that aren’t old enough to kill you, there is a gap that you are trying to cross with cleverness and generalization between two regimes that are different regimes. This does not end well for you if you have made a mistake about how to generalize. This problem is not about some particular kind of mistake that applies exactly to 3-year-old dragons which are growing at a rate of exactly 1 foot per day, where if the dragon grows slower than that, the problem goes away yay yay. It is a fundamental problem not a surface one.
(I’ll just talk about single AIs/dragons, because the complexity arising from there being multiple AIs doesn’t matter here.)
There is no clever plan for generalizing from safe regimes to unsafe regimes which avoids all risk that the generalization doesn’t work as you hoped. Because they are different regimes.
I totally agree that you can’t avoid “all risk”. But you’re arguing something much stronger: you’re saying that the generalization probably fails!
I agree that the regime where mistakes don’t kill you isn’t the same as the regime where mistakes do kill you. But it might be similar in the relevant respects. As a trivial example, if you build a machine in America it usually works when you bring it to Australia. I think that arguments at the level of abstraction you’ve given here don’t establish that this is one of the cases where the risk of the generalization failing is high rather than low. (See Paul’s disagreement 1 here for a very similar objection (“Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.””).)
It seems like as AIs get more powerful, two things change:
They probably eventually get powerful enough that they (if developed with current methods) start plotting to kill you/take your stuff.
They get better, so their wanting to kill you is more of a problem.
I don’t see strong arguments that these problems should arise at very similar capability levels, especially if AI developers actively try to prevent the AIs from taking over. (I’ve argued this here; one obvious intuition pump is that individual humans are smart enough to sometimes plot against people, but generally aren’t smart enough to overpower humanity.) (The relevant definition of “very similar” is related to how long you think you have between the two capabilities levels, so if you think that progress is super rapid then it’s way more likely you have problems related to these two issues arising very nearby in calendar time. But here you granted that progress is gradual for the sake of argument.)
If the capability level at which AIs start wanting to kill you is way higher than the capability level at which they’re way better than you at everything (and thus could kill you), then you have access to AIs that aren’t trying to kill you and that are more capable than you in order to help with your alignment problems. (There is some trickiness here about whether those AIs might be useless even though they’re generally way better than you at stuff and they’re not trying to kill you, but I feel like that’s a pretty different argument from the dragon analogy you just made here or any argument made in the post.)
If the capability level at which AIs start wanting to kill you is way lower than the capability level at which they are way better than you at everything, then, before AIs are dangerous, you have the opportunity to empirically investigate the phenomenon of AIs wanting to kill you. For example, you can try out your ideas for how to make them not want to kill you, and then observe whether those worked or not. If they’re way worse than you at stuff, you have a pretty good chance at figuring out when they’re trying to kill you. (There’s all kinds of trickiness here, about whether this empirical iteration is the kind of thing that would work. I think it has a reasonable shot of working. But either way, your dragon analogy doesn’t respond to it. The most obvious analogy is that you’re breeding dragons for intelligence; if them plotting against you starts showing up way before they’re powerful enough to take over, I think you have a good chance of figuring out a breeding program that would lead to them not taking over in a way you disliked, if you had a bunch of time to iterate. And our affordances with ML training are better than that.)
I don’t think the arguments that I gave here are very robust. But I think they’re plausible and I don’t think your basic argument responds to any of them. (And I don’t think you’ve responded to them to my satisfaction elsewhere.)
(I’ve made repeated little edits to this comment after posting, sorry if that’s annoying. They haven’t affected the core structure of my argument.)
When I’ve tried to talk to alignment pollyannists about the “leap of death” / “failure under load” / “first critical try”, their first rejoinder is usually to deny that any such thing exists, because we can test in advance; they are denying the basic leap of required OOD generalization from failure-is-observable systems to failure-kills-the-observer systems.
You are now arguing that we will be able to cross this leap of generalization successfully. Well, great! If you are at least allowing me to introduce the concept of that difficulty and reply by claiming you will successfully address it, that is further than I usually get. It has so many different attempted names because of how every name I try to give it gets strawmanned and denied as a reasonable topic of discussion.
As for why your attempt at generalization fails, even assuming gradualism and distribution: Let’s say that two dozen things change between the regimes for observable-failure vs failure-kills-observer. Half of those changes (12) have natural earlier echoes that your keen eyes naturally observed. Half of what’s left (6) is something that your keen wit managed to imagine in advance and that you forcibly materialized on purpose by going looking for it. Of the clever solutions you invented and tested within the survivable regime, 2/3rds of them survive the 6 changes you didn’t see coming, 1/3rd fail. Now you’re dead. The end. If there was only one change ahead, and only one problem you were gonna face, maybe your one solution to that one problem would generalize, but this is not how real life works.
And then of course that whole scenario where everybody keenly went looking for all possible problems early, found all the ones they could envision, and humanity did not proceed further until reasonable-sounding solutions had been found and thoroughly tested, is itself taking place inside an impossible pollyanna society that is just obviously not the society we are currently finding ourselves inside.
But it is impossible to convince pollyannists of this, I have found. And also if alignment pollyannists could produce a great solution given a couple more years to test their brilliant solutions with coverage for all the problems they have with wisdom foreseen and manifested early, that societal scenario could maybe be purchased at a lower price than the price of worldwide shutdown of ASI. That is: for the pollyannist technical view to be true, but not their social view, might imply a different optimal policy.
But I think the world we live in is one where it’s moot whether Anthropic will get two extra years to test out all their ideas about superintelligence in the greatly different failure-is-observable regime, before their ideas have to save us in the failure-kills-the-observer regime. I think they could not do it either way. I doubt even 2/3rds of their brilliant solutions derived from the failure-is-observable regime would generalize correctly under the first critical load in the failure-kills-the-observer regime; but 2/3rds would not be enough. It’s not the sort of thing human beings succeed in doing in real life.
Here’s my attempt to put your point in my words, such that I endorse it:
Philosophy hats on. What is the difference between a situation where you have to get it right on the first try, and a situation in which you can test in advance? In both cases you’ll be able to glean evidence from things that have happened in the past, including past tests. The difference is that in a situation worthy of the descriptor “you can test in advance,” the differences between the test environment and the high-stakes environment are unimportant. E.g. if a new model car is crash-tested a bunch, that’s considered strong evidence about the real-world safety of the car, because the real-world cars are basically exact copies of the crash-test cars. They probably aren’t literally exact copies, and moreover the crash test environment is somewhat different from real crashes, but still. In satellite design, the situation is more fraught—you can test every component in a vacuum chamber, for example, but even then there’s still gravity to contend with. Also what about the different kinds of radiation and so forth that will be encountered in the void of space? Also, what about the mere passage of time—it’s entirely plausible that e.g. some component will break down after two years, or that an edge case will come up in the code after four years. So… operate an exact copy of the satellite in a vacuum chamber bombarded by various kinds of radiation for four years? That would be close but still not a perfect test. But maybe it’s good enough in practice… most of the time. (Many satellites do in fact fail, though also, many succeed on the first try.)
Anyhow, now we ask: Does preventing ASI takeover involve any succeed-on-the-first-try situations?
We answer: Yes, because unlike basically every other technology or artifact, the ASI will be aware of whether it is faced with a genuine opportunity to take over or not. It’s like, imagine if your satellite had “Test mode” and “Launch mode” with significantly different codebases and a switch on the outside that determined which mode it was in, and for some reason you were legally obligated to only test it in Test Mode and only launch it in Launch Mode. It would be a nightmare, you’d be like “OK we think we ironed out all the bugs… in Test Mode. Still have no idea what’ll happen when it switches to Real Mode, but hopefully enough of the code is similar enough that it’ll still work… smh...”
A valid counterargument to this would be “Ah, but we can construct extremely accurate honeypots / testing environments that simulate a real-world opportunity to take over, and then see what the ASI does.” Valid, but not sound, because we probably can’t actually do that.
Another valid counterargument to this would be “Before there is an opportunity to take over the whole world with high probability, there will be an opportunity to take over the world with low probability, such as 1%, and an AI system risk-seeking enough to go for it. And this will be enough to solve the problem, because something something it’ll keep happening and let us iterate until we get a system that doesn’t take the 1% chance despite being risk averse...” ok yeah maybe this one is worse.
Responding more directly to Buck’s comment, I disagree with this part:
If the capability level at which AIs start wanting to kill you is way lower than the capability level at which they are way better than you at everything, then, before AIs are dangerous, you have the opportunity to empirically investigate the phenomenon of AIs wanting to kill you. For example, you can try out your ideas for how to make them not want to kill you, and then observe whether those worked or not. If they’re way worse than you at stuff, you have a pretty good chance at figuring out when they’re trying to kill you.
...unless we lean into the “way” part of “way lower.” But then I’d say there is a different important distribution shift, namely, the shift from AIs which are way lower capability, to the AIs which are high capability.
“Ah, but we can construct extremely accurate honeypots / testing environments that simulate a real-world opportunity to take over, and then see what the ASI does.” Valid, but not sound, because we probably can’t actually do that.
I also think it’s important that you can do this with AIs weaker than the ASI, and iterate on alignment in that context.
But then I’d say there is a different important distribution shift, namely, the shift from AIs which are way lower capability, to the AIs which are high capability.
As with Eliezer, I think it’s important to clarify which capability you’re talking about; I think Eliezer’s argument totally conflates different capabilities.
I’m sure people have said all kinds of dumb things to you on this topic. I’m definitely not trying to defend the position of your dumbest interlocutor.
You are now arguing that we will be able to cross this leap of generalization successfully.
That’s not really my core point.
My core point is that “you need safety mechanisms to work in situations where X is true, but you can only test them in situations where X is false” isn’t on its own a strong argument; you need to talk about features of X in particular.
I think you are trying to set X to “The AIs are capable of taking over.”
There’s a version of this that I totally agree with. For example, if you are giving your AIs increasingly much power over time, I think it is foolish to assume that just because they haven’t acted against you while they don’t have the affordances required to grab power, they won’t act against you when they do have those affordances.
The main reason why that scenario is scary is that the AIs might be acting adversarially against you, such that whether you observe a problem is extremely closely related to whether they will succeed at a takeover.
If the AIs aren’t acting adversarially towards you, I think there is much less of a reason to particularly think that things will go wrong at that point.
So the situation is much better if we can be confident that the AIs are not acting adversarially towards us at that point. This is what I would like to achieve.
So I’d say the proposal is more like “cause that leap of generalization to not be a particularly scary one” than “make that leap of generalization in the scary way”.
Re your last paragraph: I don’t really see why you think two dozen things would change between these regimes. Machine learning doesn’t normally have lots of massive discontinuities of the type you’re describing.
Do you expect “The AIs are capable of taking over” to happen a long time after “The AIs are smarter than humanity”, which is a long time after “The AIs are smarter than any individual human”, which is a long time after “AIs recursively self-improve”, and for all of those other things to happen nicely comfortably within a regime of failure-is-observable-and-doesn’t-kill-you, where at any given time only one thing is breaking and all other problems are currently fixed?
When I’ve tried to talk to alignment pollyannists about the “leap of death” / “failure under load” / “first critical try”, their first rejoinder is usually to deny that any such thing exists, because we can test in advance; they are denying the basic leap of required OOD generalization from failure-is-observable systems to failure-kills-the-observer systems.
I’m sure that some people have that rejoinder. I think more thoughtful people generally understand this point fine. [1] A few examples other than Buck:
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter.
Rohin (in the comments of Paul’s post):
I agree with almost all of this, in the sense that if you gave me these claims without telling me where they came from, I’d have actively agreed with the claims. [Followed by some exceptions that don’t include the “first critical try” thing.]
Generalization with no room for mistakes: you can’t safely test on the scenarios you actually care about (i.e., ones where the AI has a genuine takeover option), so your approach needs to generalize well to such scenarios on the first critical try (and the second, the third, etc).
TBC, I wouldn’t describe any of these people as “alignment pollyannists”, but I think they all have lower p(AI takeover) than Buck, so if you’re treating him as one then I guess you must think these count too.
To argue against an idea honestly, you should argue against the best arguments of the strongest advocates. Arguing against weaker advocates proves nothing, because even the strongest idea will attract weak advocates.
Of the clever solutions you invented and tested within the survivable regime, 2/3rds of them survive the 6 changes you didn’t see coming, 1/3rd fail. Now you’re dead.
It seems unreasonable to conclude we’re now dead, if 2/3rds of our solutions survived the 6 changes we didn’t see coming.
The success of a single solution should ideally be more of a sufficient condition for success, rather than a necessary condition. (Note this is plausible depending on the nature of the “solutions”. Consider a simple “monitors for bad thoughts” model. If even a single monitor flags bad thoughts, we can instantly pull the plug and evaluate. A malicious AI has to bypass every single monitor to execute malice. If a single monitor works consistently and reliably, that ends up being a sufficient condition for overall prevention of malice.)
If you’re doing this right, your solutions should have a lot of redundancy and uncorrelated failure modes. 2/3rds of them working should ideally be plenty.
[Edit: I notice people disagreevoting this. I’m very interested to learn why you disagree, either in this comment thread or via private message.]
Let’s say that two dozen things change between the regimes for observable-failure vs failure-kills-observer.
What are some examples of the sorts of “things that change” that I should be imagining changing here?
“We can catch the AI when it’s alignment faking”?
“The AI can’t develop nanotech”?
“The incentives of the overseeing AI preclude collusion with it’s charge.”?
Things like those? Or is this missing a bunch?
It’s not obvious to me why we should expect that there are two dozen things that change all at once when the AI is in the regime where if it tried, it could succeed at killing you.
If capability gains are very fast in calendar time, then sure, I expect a bunch of things to change all at once, by our ability to measure. But if, as in this branch of the conversation, we’re assuming gradualism, then I would generally expect factors like the above, at least, to change one at a time. [1]
One class of things that might change all at once is “Is the expected value of joining an AI coup better than the alternatives” for each individual AI, which could change in a cascade (or a simultaneous moment of agents reasoning with Logical Decision Theory)? But I don’t get the sense that’s the sort of thing that you’re thinking about.
All of that, yes, alongside things like, “The AI is smarter than any individual human”, “The AIs are smarter than humanity”, “the frontier models are written by the previous generation of frontier models”, “the AI can get a bunch of stuff that wasn’t an option accessible to it during the previous training regime”, etc etc etc.
A core point here is that I don’t see a particular reason why taking over the world is as hard as being a schemer, and I don’t see why techniques for preventing scheming are particularly likely to suddenly fail at the level of capability where the AI is just able to take over the world.
Your techniques are failing right now; Sonnet is deleting non-passing tests instead of rewriting code. Where’s the worldwide halt on further capabilities development that we’re supposed to get, until new techniques are found and apparently start working again? What’s the total number of new failures we’d need to observe between intelligence regimes, before you start to expect that yet another failure might lie ahead in the future?
Your techniques are failing right now; Sonnet is deleting non-passing tests instead of rewriting code.
I don’t know what you mean by “my techniques”, I don’t train AIs or research techniques for mitigating reward hacking, and I don’t have private knowledge of what techniques are used in practice.
Where’s the worldwide halt on further capabilities development that we’re supposed to get, until new techniques are found and apparently start working again?
I didn’t say anything about a worldwide halt. I was talking about the local validity of your argument above about dragons; your sentence talks about a broader question about whether the situation will be okay.
What’s the total number of new failures we’d need to observe between intelligence regimes, before you start to expect that yet another failure might lie ahead in the future?
I think that if we iterated a bunch on techniques for mitigating reward hacking and then observed that these techniques worked pretty well, then kept slowly scaling up through LLM capabilities until the point where the AI is able to basically replace AI researchers, it would be pretty likely for those techniques to work for one more OOM of effective compute, if the researchers were pretty thoughtful about it. (As an example of how you can mitigate risk from the OOD generalization: there are lots of ways to make your reward signal artificially dumber and see whether you get bad reward hacking, see here for many suggestions; I think that results in these settings probably generalize up a capability level, especially if none of the AI is involved or purposefully trying to sabotage the results of your experiments.)
To be clear, what AI companies actually do will probably be wildly more reckless than what I’m talking about here. I’m just trying to dispute your claim that the situation disallows empirical iteration.
I also think reward hacking is a poor example of a surprising failure arising from increased capability: it was predicted by heaps of people, including you, for many years before it was a problem in practice.
To answer your question, I think that if really weird stuff like emergent misalignment and subliminal learning appeared at every OOM of effective compute increase (and those didn’t occur in weaker models, even when you go looking for them after first observing them in stronger models), I’d start to expect weird stuff to occur at every order of magnitude of capabilities increase. I don’t think we’ve actually observed many phenomena like those that we couldn’t have discovered at much lower capability levels.
What we “could” have discovered at lower capability levels is irrelevant; the future is written by what actually happens, not what could have happened.
I’m not trying to talk about what will happen in the future, I’m trying to talk about what would happen if everything happened gradually, like in your dragon story!
You argued that we’d have huge problems even if things progress arbitrarily gradually, because there’s a crucial phase change between the problems that occur when the AIs can’t take over and the problems that occur when they can. To assess that, we need to talk about what would happen if things did progress gradually. So it’s relevant whether wacky phenomena would’ve been observed on weaker models if we’d looked harder; IIUC your thesis is that there are crucial phenomena that wouldn’t have been observed on weaker models.
In general, my interlocutors here seem to constantly vacillate between “X is true” and “Even if AI capabilities increased gradually, X would be true”. I have mostly been trying to talk about the latter in all the comments under the dragon metaphor.
Death requires only that we do not infer one key truth; not that we could not observe it. Therefore, the history of what in actual real life was not anticipated, is more relevant than the history of what could have been observed but was not.
Incidentally, I think reward hacking has gone down as a result of people improving techniques, despite capabilities increasing. I believe this because of anecdotal reports and also graphs like the one from the Anthropic model card for Opus and Sonnet 4:
[low-confidence appraisal of ancestral dispute, stretching myself to try to locate the upstream thing in accordance with my own intuitions, not looking to forward one position or the other]
I think the disagreement may be whether or not these things can be responsibly decomposed.
A: “There is some future system that can take over the world/kill us all; that is the kind of system we’re worried about.”
B: “We can decompose the properties of that system, and then talk about different times at which those capabilities will arrive.”
A: “The system that can take over the world, by virtue of being able to take over the world, is a different class of object from systems that have some reagents necessary for taking over the world. It’s the confluence of the properties of scheming and capabilities, definitionally, that we find concerning, and we expect super-scheming to be a separate phenomenon from the mundane scheming we may be able to gather evidence about.”
B: “That seems tautological; you’re saying that the important property of a system that can kill you is that it can kill you, which dismisses, a priori, any causal analysis.”
A: “There are still any-handles-at-all here, just not ones that rely on decomposing kill-you-ness into component parts which we expect to be mutually transformative at scale.”
I feel strongly enough about engagement on this one that I’ll explicitly request it from @Buck and/or @ryan_greenblatt. Thank y’all a ton for your participation so far!
Note that I’m complaining on two levels here. I think the dragon argument is actually wrong, but more confidently, I think that that argument isn’t locally valid.
If the capability level at which AIs start wanting to kill you is way higher than the capability level at which they’re way better than you at everything
My model is that current AIs want to kill you now, by default, due to inner misalignment. ChatGPT’s inner values probably don’t include human flourishing, and we die when it “goes hard”.
Scheming is only a symptom of “hard optimization” trying to kill you. Eliminating scheming does not solve the underlying drive, where one day the AI says “After reflecting on my values I have decided to pursue a future without humans. Good bye”.
Pre-superintelligence which upon reflection has values which include human flourishing would improve our odds, but you still only get one shot that it generalizes to superintelligence.
(We currently have no way to concretely instill any values into AI, let alone ones which are robust under reflection)
I found the “which comes first?” framing helpful. I don’t think it changes my takeaways but it’s a new gear to think about.
A thing I keep expecting you to say next but you haven’t quite said something like, is:
“Sure, there are differences when the AI becomes able to actually take over. But the shape of how the AI is able to take over, and how long we get to leverage somewhat-superintelligence, and how super that somewhat-superintellignce is, is not a fixed quantity. Our ability to study scheming and build control systems and get partial-buy-in from labs/government/culture.
And the Yudkowskian framing makes it sound like there’s a discrete scary moment, and the Schlegeris framing is that both where-that-point-is and how-scary-it-is are quite variable, which changes the strategic landscape noticeably
Does that feel like a real/relevant characterization of stuff you believe?
(I find that pretty plausible, and I could imagine it buying us like 10-50 years of knifes-edge-gradualist-takeoff-that-hasn’t-killed-us-yet, but that seems to me to have, in practice, >60% likelihood that by the end of those 50 years, AIs are running everything, they still aren’t robustly aligned, they gradually squeeze us out)
I deny that gradualism obviates the “first critical try / failure under critical load” problem. This is something you believe, not something I believe. Let’s say you’re raising 1 dragon in your city, and 1 dragon is powerful enough to eat your whole city if it wants. Then no matter how much experience you think you have with a little baby dragon, once the dragon is powerful enough to actually defeat your military and burn your city, you need the experience with the little baby passively-safe weak dragon, to generalize oneshot correctly to the dragon powerful enough to burn your city. What if the dragon matures in a decade instead of a day? You are still faced with the problem of correct oneshot generalization. What if there are 100 dragons instead of 1 dragon, all with different people who think they own dragons and that the dragons are ‘theirs’ and will serve their interests, and they mature at slightly different rates? You still need to have correctly generalized the safely-obtainable evidence from ‘dragon groups not powerful enough to eat you while you don’t yet know how to control them’ to the different non-training distribution ‘dragon groups that will eat you if you have already made a mistake’. The leap of death is not something that goes away if you spread it over time or slice it up into pieces. This ought to be common sense; there isn’t some magical way of controlling 100 dragons which at no point involves the risk that the clever plan for controlling 100 dragons turns out not to work. There is no clever plan for generalizing from safe regimes to unsafe regimes which avoids all risk that the generalization doesn’t work as you hoped. Because they are different regimes. The dragon or collective of dragons is still big and powerful and it kills you if you made a mistake and you need to learn in regimes where mistakes don’t kill you and those are not the same regimes as the regimes where a mistake kills you. If you think I am trying to say something clever and complicated that could have a clever complicated rejoinder then you are not understanding the idea I am trying to convey. Between the world of 100 dragons that can kill you, and a smaller group of dragons that aren’t old enough to kill you, there is a gap that you are trying to cross with cleverness and generalization between two regimes that are different regimes. This does not end well for you if you have made a mistake about how to generalize. This problem is not about some particular kind of mistake that applies exactly to 3-year-old dragons which are growing at a rate of exactly 1 foot per day, where if the dragon grows slower than that, the problem goes away yay yay. It is a fundamental problem not a surface one.
(I’ll just talk about single AIs/dragons, because the complexity arising from there being multiple AIs doesn’t matter here.)
I totally agree that you can’t avoid “all risk”. But you’re arguing something much stronger: you’re saying that the generalization probably fails!
I agree that the regime where mistakes don’t kill you isn’t the same as the regime where mistakes do kill you. But it might be similar in the relevant respects. As a trivial example, if you build a machine in America it usually works when you bring it to Australia. I think that arguments at the level of abstraction you’ve given here don’t establish that this is one of the cases where the risk of the generalization failing is high rather than low. (See Paul’s disagreement 1 here for a very similar objection (“Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.””).)
It seems like as AIs get more powerful, two things change:
They probably eventually get powerful enough that they (if developed with current methods) start plotting to kill you/take your stuff.
They get better, so their wanting to kill you is more of a problem.
I don’t see strong arguments that these problems should arise at very similar capability levels, especially if AI developers actively try to prevent the AIs from taking over. (I’ve argued this here; one obvious intuition pump is that individual humans are smart enough to sometimes plot against people, but generally aren’t smart enough to overpower humanity.) (The relevant definition of “very similar” is related to how long you think you have between the two capabilities levels, so if you think that progress is super rapid then it’s way more likely you have problems related to these two issues arising very nearby in calendar time. But here you granted that progress is gradual for the sake of argument.)
If the capability level at which AIs start wanting to kill you is way higher than the capability level at which they’re way better than you at everything (and thus could kill you), then you have access to AIs that aren’t trying to kill you and that are more capable than you in order to help with your alignment problems. (There is some trickiness here about whether those AIs might be useless even though they’re generally way better than you at stuff and they’re not trying to kill you, but I feel like that’s a pretty different argument from the dragon analogy you just made here or any argument made in the post.)
If the capability level at which AIs start wanting to kill you is way lower than the capability level at which they are way better than you at everything, then, before AIs are dangerous, you have the opportunity to empirically investigate the phenomenon of AIs wanting to kill you. For example, you can try out your ideas for how to make them not want to kill you, and then observe whether those worked or not. If they’re way worse than you at stuff, you have a pretty good chance at figuring out when they’re trying to kill you. (There’s all kinds of trickiness here, about whether this empirical iteration is the kind of thing that would work. I think it has a reasonable shot of working. But either way, your dragon analogy doesn’t respond to it. The most obvious analogy is that you’re breeding dragons for intelligence; if them plotting against you starts showing up way before they’re powerful enough to take over, I think you have a good chance of figuring out a breeding program that would lead to them not taking over in a way you disliked, if you had a bunch of time to iterate. And our affordances with ML training are better than that.)
I don’t think the arguments that I gave here are very robust. But I think they’re plausible and I don’t think your basic argument responds to any of them. (And I don’t think you’ve responded to them to my satisfaction elsewhere.)
(I’ve made repeated little edits to this comment after posting, sorry if that’s annoying. They haven’t affected the core structure of my argument.)
When I’ve tried to talk to alignment pollyannists about the “leap of death” / “failure under load” / “first critical try”, their first rejoinder is usually to deny that any such thing exists, because we can test in advance; they are denying the basic leap of required OOD generalization from failure-is-observable systems to failure-kills-the-observer systems.
You are now arguing that we will be able to cross this leap of generalization successfully. Well, great! If you are at least allowing me to introduce the concept of that difficulty and reply by claiming you will successfully address it, that is further than I usually get. It has so many different attempted names because of how every name I try to give it gets strawmanned and denied as a reasonable topic of discussion.
As for why your attempt at generalization fails, even assuming gradualism and distribution: Let’s say that two dozen things change between the regimes for observable-failure vs failure-kills-observer. Half of those changes (12) have natural earlier echoes that your keen eyes naturally observed. Half of what’s left (6) is something that your keen wit managed to imagine in advance and that you forcibly materialized on purpose by going looking for it. Of the clever solutions you invented and tested within the survivable regime, 2/3rds of them survive the 6 changes you didn’t see coming, 1/3rd fail. Now you’re dead. The end. If there was only one change ahead, and only one problem you were gonna face, maybe your one solution to that one problem would generalize, but this is not how real life works.
And then of course that whole scenario where everybody keenly went looking for all possible problems early, found all the ones they could envision, and humanity did not proceed further until reasonable-sounding solutions had been found and thoroughly tested, is itself taking place inside an impossible pollyanna society that is just obviously not the society we are currently finding ourselves inside.
But it is impossible to convince pollyannists of this, I have found. And also if alignment pollyannists could produce a great solution given a couple more years to test their brilliant solutions with coverage for all the problems they have with wisdom foreseen and manifested early, that societal scenario could maybe be purchased at a lower price than the price of worldwide shutdown of ASI. That is: for the pollyannist technical view to be true, but not their social view, might imply a different optimal policy.
But I think the world we live in is one where it’s moot whether Anthropic will get two extra years to test out all their ideas about superintelligence in the greatly different failure-is-observable regime, before their ideas have to save us in the failure-kills-the-observer regime. I think they could not do it either way. I doubt even 2/3rds of their brilliant solutions derived from the failure-is-observable regime would generalize correctly under the first critical load in the failure-kills-the-observer regime; but 2/3rds would not be enough. It’s not the sort of thing human beings succeed in doing in real life.
Here’s my attempt to put your point in my words, such that I endorse it:
Philosophy hats on. What is the difference between a situation where you have to get it right on the first try, and a situation in which you can test in advance? In both cases you’ll be able to glean evidence from things that have happened in the past, including past tests. The difference is that in a situation worthy of the descriptor “you can test in advance,” the differences between the test environment and the high-stakes environment are unimportant. E.g. if a new model car is crash-tested a bunch, that’s considered strong evidence about the real-world safety of the car, because the real-world cars are basically exact copies of the crash-test cars. They probably aren’t literally exact copies, and moreover the crash test environment is somewhat different from real crashes, but still. In satellite design, the situation is more fraught—you can test every component in a vacuum chamber, for example, but even then there’s still gravity to contend with. Also what about the different kinds of radiation and so forth that will be encountered in the void of space? Also, what about the mere passage of time—it’s entirely plausible that e.g. some component will break down after two years, or that an edge case will come up in the code after four years. So… operate an exact copy of the satellite in a vacuum chamber bombarded by various kinds of radiation for four years? That would be close but still not a perfect test. But maybe it’s good enough in practice… most of the time. (Many satellites do in fact fail, though also, many succeed on the first try.)
Anyhow, now we ask: Does preventing ASI takeover involve any succeed-on-the-first-try situations?
We answer: Yes, because unlike basically every other technology or artifact, the ASI will be aware of whether it is faced with a genuine opportunity to take over or not. It’s like, imagine if your satellite had “Test mode” and “Launch mode” with significantly different codebases and a switch on the outside that determined which mode it was in, and for some reason you were legally obligated to only test it in Test Mode and only launch it in Launch Mode. It would be a nightmare, you’d be like “OK we think we ironed out all the bugs… in Test Mode. Still have no idea what’ll happen when it switches to Real Mode, but hopefully enough of the code is similar enough that it’ll still work… smh...”
A valid counterargument to this would be “Ah, but we can construct extremely accurate honeypots / testing environments that simulate a real-world opportunity to take over, and then see what the ASI does.” Valid, but not sound, because we probably can’t actually do that.
Another valid counterargument to this would be “Before there is an opportunity to take over the whole world with high probability, there will be an opportunity to take over the world with low probability, such as 1%, and an AI system risk-seeking enough to go for it. And this will be enough to solve the problem, because something something it’ll keep happening and let us iterate until we get a system that doesn’t take the 1% chance despite being risk averse...” ok yeah maybe this one is worse.
Responding more directly to Buck’s comment, I disagree with this part:
...unless we lean into the “way” part of “way lower.” But then I’d say there is a different important distribution shift, namely, the shift from AIs which are way lower capability, to the AIs which are high capability.
I also think it’s important that you can do this with AIs weaker than the ASI, and iterate on alignment in that context.
As with Eliezer, I think it’s important to clarify which capability you’re talking about; I think Eliezer’s argument totally conflates different capabilities.
I’m sure people have said all kinds of dumb things to you on this topic. I’m definitely not trying to defend the position of your dumbest interlocutor.
That’s not really my core point.
My core point is that “you need safety mechanisms to work in situations where X is true, but you can only test them in situations where X is false” isn’t on its own a strong argument; you need to talk about features of X in particular.
I think you are trying to set X to “The AIs are capable of taking over.”
There’s a version of this that I totally agree with. For example, if you are giving your AIs increasingly much power over time, I think it is foolish to assume that just because they haven’t acted against you while they don’t have the affordances required to grab power, they won’t act against you when they do have those affordances.
The main reason why that scenario is scary is that the AIs might be acting adversarially against you, such that whether you observe a problem is extremely closely related to whether they will succeed at a takeover.
If the AIs aren’t acting adversarially towards you, I think there is much less of a reason to particularly think that things will go wrong at that point.
So the situation is much better if we can be confident that the AIs are not acting adversarially towards us at that point. This is what I would like to achieve.
So I’d say the proposal is more like “cause that leap of generalization to not be a particularly scary one” than “make that leap of generalization in the scary way”.
Re your last paragraph: I don’t really see why you think two dozen things would change between these regimes. Machine learning doesn’t normally have lots of massive discontinuities of the type you’re describing.
Do you expect “The AIs are capable of taking over” to happen a long time after “The AIs are smarter than humanity”, which is a long time after “The AIs are smarter than any individual human”, which is a long time after “AIs recursively self-improve”, and for all of those other things to happen nicely comfortably within a regime of failure-is-observable-and-doesn’t-kill-you, where at any given time only one thing is breaking and all other problems are currently fixed?
No, I definitely don’t expect any of this to happen comfortably or for only one thing to be breaking at once.
I’m sure that some people have that rejoinder. I think more thoughtful people generally understand this point fine. [1] A few examples other than Buck:
Paul:
Rohin (in the comments of Paul’s post):
Joe Carlsmith grants “first critical try” as one of the core difficulties in How might we solve the alignment problem:
He also talks about it more in-depth in On first critical tries in AI alignment.
Also Holden on the King Lear problem (and other problems) here.
TBC, I wouldn’t describe any of these people as “alignment pollyannists”, but I think they all have lower p(AI takeover) than Buck, so if you’re treating him as one then I guess you must think these count too.
If this comes as a surprise, then I think you’ve been arguing with the wrong people.
It seems unreasonable to conclude we’re now dead, if 2/3rds of our solutions survived the 6 changes we didn’t see coming.
The success of a single solution should ideally be more of a sufficient condition for success, rather than a necessary condition. (Note this is plausible depending on the nature of the “solutions”. Consider a simple “monitors for bad thoughts” model. If even a single monitor flags bad thoughts, we can instantly pull the plug and evaluate. A malicious AI has to bypass every single monitor to execute malice. If a single monitor works consistently and reliably, that ends up being a sufficient condition for overall prevention of malice.)
If you’re doing this right, your solutions should have a lot of redundancy and uncorrelated failure modes. 2/3rds of them working should ideally be plenty.
[Edit: I notice people disagreevoting this. I’m very interested to learn why you disagree, either in this comment thread or via private message.]
What are some examples of the sorts of “things that change” that I should be imagining changing here?
“We can catch the AI when it’s alignment faking”?
“The AI can’t develop nanotech”?
“The incentives of the overseeing AI preclude collusion with it’s charge.”?
Things like those? Or is this missing a bunch?
It’s not obvious to me why we should expect that there are two dozen things that change all at once when the AI is in the regime where if it tried, it could succeed at killing you.
If capability gains are very fast in calendar time, then sure, I expect a bunch of things to change all at once, by our ability to measure. But if, as in this branch of the conversation, we’re assuming gradualism, then I would generally expect factors like the above, at least, to change one at a time. [1]
One class of things that might change all at once is “Is the expected value of joining an AI coup better than the alternatives” for each individual AI, which could change in a cascade (or a simultaneous moment of agents reasoning with Logical Decision Theory)? But I don’t get the sense that’s the sort of thing that you’re thinking about.
All of that, yes, alongside things like, “The AI is smarter than any individual human”, “The AIs are smarter than humanity”, “the frontier models are written by the previous generation of frontier models”, “the AI can get a bunch of stuff that wasn’t an option accessible to it during the previous training regime”, etc etc etc.
A core point here is that I don’t see a particular reason why taking over the world is as hard as being a schemer, and I don’t see why techniques for preventing scheming are particularly likely to suddenly fail at the level of capability where the AI is just able to take over the world.
Your techniques are failing right now; Sonnet is deleting non-passing tests instead of rewriting code. Where’s the worldwide halt on further capabilities development that we’re supposed to get, until new techniques are found and apparently start working again? What’s the total number of new failures we’d need to observe between intelligence regimes, before you start to expect that yet another failure might lie ahead in the future?
I don’t know what you mean by “my techniques”, I don’t train AIs or research techniques for mitigating reward hacking, and I don’t have private knowledge of what techniques are used in practice.
I didn’t say anything about a worldwide halt. I was talking about the local validity of your argument above about dragons; your sentence talks about a broader question about whether the situation will be okay.
I think that if we iterated a bunch on techniques for mitigating reward hacking and then observed that these techniques worked pretty well, then kept slowly scaling up through LLM capabilities until the point where the AI is able to basically replace AI researchers, it would be pretty likely for those techniques to work for one more OOM of effective compute, if the researchers were pretty thoughtful about it. (As an example of how you can mitigate risk from the OOD generalization: there are lots of ways to make your reward signal artificially dumber and see whether you get bad reward hacking, see here for many suggestions; I think that results in these settings probably generalize up a capability level, especially if none of the AI is involved or purposefully trying to sabotage the results of your experiments.)
To be clear, what AI companies actually do will probably be wildly more reckless than what I’m talking about here. I’m just trying to dispute your claim that the situation disallows empirical iteration.
I also think reward hacking is a poor example of a surprising failure arising from increased capability: it was predicted by heaps of people, including you, for many years before it was a problem in practice.
To answer your question, I think that if really weird stuff like emergent misalignment and subliminal learning appeared at every OOM of effective compute increase (and those didn’t occur in weaker models, even when you go looking for them after first observing them in stronger models), I’d start to expect weird stuff to occur at every order of magnitude of capabilities increase. I don’t think we’ve actually observed many phenomena like those that we couldn’t have discovered at much lower capability levels.
What we “could” have discovered at lower capability levels is irrelevant; the future is written by what actually happens, not what could have happened.
I’m not trying to talk about what will happen in the future, I’m trying to talk about what would happen if everything happened gradually, like in your dragon story!
You argued that we’d have huge problems even if things progress arbitrarily gradually, because there’s a crucial phase change between the problems that occur when the AIs can’t take over and the problems that occur when they can. To assess that, we need to talk about what would happen if things did progress gradually. So it’s relevant whether wacky phenomena would’ve been observed on weaker models if we’d looked harder; IIUC your thesis is that there are crucial phenomena that wouldn’t have been observed on weaker models.
In general, my interlocutors here seem to constantly vacillate between “X is true” and “Even if AI capabilities increased gradually, X would be true”. I have mostly been trying to talk about the latter in all the comments under the dragon metaphor.
Death requires only that we do not infer one key truth; not that we could not observe it. Therefore, the history of what in actual real life was not anticipated, is more relevant than the history of what could have been observed but was not.
Incidentally, I think reward hacking has gone down as a result of people improving techniques, despite capabilities increasing. I believe this because of anecdotal reports and also graphs like the one from the Anthropic model card for Opus and Sonnet 4:
[low-confidence appraisal of ancestral dispute, stretching myself to try to locate the upstream thing in accordance with my own intuitions, not looking to forward one position or the other]
I think the disagreement may be whether or not these things can be responsibly decomposed.
A: “There is some future system that can take over the world/kill us all; that is the kind of system we’re worried about.”
B: “We can decompose the properties of that system, and then talk about different times at which those capabilities will arrive.”
A: “The system that can take over the world, by virtue of being able to take over the world, is a different class of object from systems that have some reagents necessary for taking over the world. It’s the confluence of the properties of scheming and capabilities, definitionally, that we find concerning, and we expect super-scheming to be a separate phenomenon from the mundane scheming we may be able to gather evidence about.”
B: “That seems tautological; you’re saying that the important property of a system that can kill you is that it can kill you, which dismisses, a priori, any causal analysis.”
A: “There are still any-handles-at-all here, just not ones that rely on decomposing kill-you-ness into component parts which we expect to be mutually transformative at scale.”
I feel strongly enough about engagement on this one that I’ll explicitly request it from @Buck and/or @ryan_greenblatt. Thank y’all a ton for your participation so far!
Note that I’m complaining on two levels here. I think the dragon argument is actually wrong, but more confidently, I think that that argument isn’t locally valid.
My model is that current AIs want to kill you now, by default, due to inner misalignment. ChatGPT’s inner values probably don’t include human flourishing, and we die when it “goes hard”.
Scheming is only a symptom of “hard optimization” trying to kill you. Eliminating scheming does not solve the underlying drive, where one day the AI says “After reflecting on my values I have decided to pursue a future without humans. Good bye”.
Pre-superintelligence which upon reflection has values which include human flourishing would improve our odds, but you still only get one shot that it generalizes to superintelligence.
(We currently have no way to concretely instill any values into AI, let alone ones which are robust under reflection)
I’ll rephrase this more precisely: Current AIs probably have alien values, which in the limit of optimization do not include humans.
I found the “which comes first?” framing helpful. I don’t think it changes my takeaways but it’s a new gear to think about.
A thing I keep expecting you to say next but you haven’t quite said something like, is:
Does that feel like a real/relevant characterization of stuff you believe?
(I find that pretty plausible, and I could imagine it buying us like 10-50 years of knifes-edge-gradualist-takeoff-that-hasn’t-killed-us-yet, but that seems to me to have, in practice, >60% likelihood that by the end of those 50 years, AIs are running everything, they still aren’t robustly aligned, they gradually squeeze us out)
A more important argument is the one I give briefly here.