I totally agree that lots of people seem to think that superintelligence is impossible, and this leads them to massively underrate risk from AI, especially AI takeover.
Suppose that I simply agree. Should we re-write the paragraph to say something like “AI systems routinely outperform humans in narrow domains. When AIs become at all competitive with human professionals on a given task, humans usually cease to be able to compete within just a handful of years. It would be unexpected if this pattern suddenly stopped applying for all the tasks that AI can’t yet compete with human professionals on.”? Do you agree that the core point would remain, if we did that rewrite?
I think that that rewrite substantially complicates the argument for AI takeover. If AIs that are about as good as humans at broad skills (e.g. software engineering, ML research, computer security, all remote jobs) exist for several years before AIs that are wildly superhuman, then the development of wildly superhuman AIs occurs in a world that is crucially different from ours, because it has those human-level-ish AIs. This matters several ways:
Broadly, it makes it much harder to predict how things will go, because it means ASI will arrive in a world less like today’s world.
It will be way more obvious that AI is a huge deal. (That is, human-level AI might be a fire alarm for ASI.)
Access to human-level AI massively changes your available options for handling misalignment risk from superintelligence.
You can maybe do lots of R&D with those human-level AIs, which might let you make a lot of progress on alignment and other research directions.
You can study them, perhaps allowing you to empirically investigate when egregious misalignment occurs and how to jankily iterate against its occurance.
You can maybe use those human-level AIs to secure yourself against superintelligence (e.g. controlling them; an important special case is using human-level AIs for tasks that you don’t trust superintelligences to do).
There’s probably misalignment risk from those human-level AIs, and unlike crazy superintelligence, those AIs can probably be controlled, and this risk should maybe be addressed.
I think that this leads to me having a pretty substantially different picture from the MIRI folk about what should be done, and also makes me feel like the MIRI story is importantly implausible in a way that seems bad from a “communicate accurately” perspective.
I think it’s maybe additionally bad to exaggerate on this particular point, because it’s the particular point that other AI safety people most disagree with you on, and that most leads to MIRI’s skepticism of their approach to mitigating these risks!
(Though my bottom line isn’t that different—I think AI takeover is like 35% likely.)
I appreciate your point about this being a particularly bad place to exaggerate, given that it’s a cruxy point of divergence with our closest allies. This makes me update harder towards the need for a rewrite.
I’m not really sure how to respond to the body of your comment, though. Like, I think we basically agree on most major points. We agree on the failure mode that relevant text of The Problem is highlighting is real and important. We agree that doing Control research is important, and that if things are slow/gradual, this gives it a better chance of working. And I think we agree that it might end up being too fast and sloppy to actually save us. I’m more pessimistic about the plan of “use the critical window of opportunity to make scientific breakthroughs that save the day” but I’m not sure that matters? Like, does “we’ll have a 3 year window of working on near-human AGIs before they’re obviously superintelligent” change the take-away?
I’m also worried that we’re diverging from the question of whether the relevant bit of source text is false. Not sure what to do about that, but I thought I’d flag it.
I see this post as trying to argue for a thesis that “if smarter-than-human AI is developed this decade, the result will be an unprecedented catastrophe.” is true with reasonably high confidence and a (less emphasized) thesis that the best/only intervention is not building ASI for a long time: “The main way we see to avoid this catastrophic outcome is to not build ASI at all, at minimum until a scientific consensus exists that we can do so without destroying ourselves.”
I think that disagreements about takeoff speeds are part of why I disagree with these claims and that the post effectively leans on very fast takeoff speeds in it’s overall perspective. Correspondingly, it seems important to not make locally invalid arguments about takeoff speeds: these invalid arguments do alter the takeaway from my perspective.
If the post was trying to argue for a weaker takeaway of “AIs seems extremely dangerous and like it poses very large risks and our survival seems uncertain” or it more clearly discussed why some (IMO reasonable) people are more optimistic (any why MIRI disagrees), I’d be less critical.
Like, does “we’ll have a 3 year window of working on near-human AGIs before they’re obviously superintelligent” change the take-away?
I think that a three-year window makes it way more complicated to analyze whether AI takeover is likely. And after doing that analysis, I think it looks 3x less likely.
I think the crux for me in these situations is “do you think it’s more valuable to increase our odds of survival on the margin in the three-year window worlds or to try to steer toward the pause worlds, and how confident are you there?” Modeling the space between here and ASI just feels like a domain with a pretty low confidence ceiling. This consideration is similar to the intuition that leads MIRI to talk about ‘default outcomes’. I find reading things that make guesses at the shape of this space interesting, but not especially edifying.
I guess I’m trying to flip the script a bit here: from my perspective, it doesn’t look like MIRI is too confident in doom; it looks like make-the-most-of-the-window people are too confident in the shape of the window as they’ve predicted it, and end up finding themselves on the side of downplaying the insanity of the risk, not because they don’t think risk levels are insanely high, but because they think there are various edge case scenarios / moonshots that, in sum, significantly reduce risk in expectation. But all of those stories look totally wild to me, and it’s extremely difficult to see the mechanisms by which they might come to pass (e.g. AI for epistemics, transformative interpretability, pause-but-just-on-the-brink-of-ASI, the AIs are kinda nice but not too nice and keep some people in a zoo, they don’t fuck with earth because it’s a rounding error on total resource in the cosmos, ARC pulls through, weakly superhuman AIs solve alignment, etc etc). I agree each of these has a non-zero chance of working, but their failures seem correlated to me such that I don’t compound my odds of each in making estimates (which I’m admittedly not especially experienced at, anyway).
Like, it’s a strange experience to hold a fringe view, to spend hundreds of hours figuring out how to share that view, and then be nitpicked to death because not enough air was left in the room for an infinite sea of sub-cases of the fringe view, when leaving enough air in the room for them could require pushing the public out entirely, or undercut the message by priming the audience to expect that some geniuses off screen in fact just have this figured out and they don’t need to worry about it (a la technological solutions to climate, pandemic response, astroid impact, numerous other large-scale risks).
I agree on the object level point with respect to this particular sentence as Ryan first lodged it; I don’t agree with the stronger mandate of ‘but what about my crazy take?’ (I don’t mean to be demeaning by this; outside view, we are all completely insane over here). In particular, many of the other views we’re expected by others to make space for undercut characterization of the risk unduly.
Forceful rhetoric is called for in our current situation (to the extent that it doesn’t undermine credibility or propagate untruth, which I agree may be happening in this object-level case).
To be clear: even by my relatively narrow/critical appraisal, I support Redwood’s work and think y’all are really good at laying out the strategic case for what you’re doing, what it is, and what it isn’t. I just wish you didn’t have to do it, because I wish we weren’t rolling the dice like this instead of waiting for more principled solutions. (side note: I am actually fine with the worlds in which a more principled solution, and its corresponding abundance, never arrives, which is a major difference between my view and yours, as well as between my view and most at MIRI)
I don’t really buy this doom is clearly the default frame. I’m not sure how important this is, but I thought I would express my perspective.
But all of those stories look totally wild to me, and it’s extremely difficult to see the mechanisms by which they might come to pass
A reasonable fraction of my non-doom worlds look like:
AIs don’t end up scheming (as in, in the vast majority of contexts) until somewhat after the point where AIs dominate top human experts at ~everything because scheming ends up being unnatural in the relevant paradigm (after moderate status quo iteration). I guess I put around 60% on this.
We have a decent amount of time at roughly this level of capability and people use these AIs to do a ton of stuff. People figure out how to get these AIs to do decent-ish conceptual research and then hand off alignment work to these systems. (Perhaps because there was decent amount of transfer from behavioral training on other things to actually trying at conceptual research and doing a decent job.) People also get advice from these systems. This goes fine given the amount of time and an only modest amount of effect and we end up in a “AIs work on furthering alignment” attractor basin.
In aggregate, I guess something like this conjunction is maybe 35% likely. (There are other sources of risk which can still occur in these worlds to be clear, like humanity collectively going crazy.) And, then you get another fraction of mass from things which are weaker than the first or weaker than the second and which require somewhat more effort on the part of humanity.
So, from my perspective “early-ish alignment was basically fine and handing off work to AIs was basically fine” is the plurality scenario and feels kinda like the default? Or at least it feels more like a coin toss.
AIs don’t end up scheming (as in, in the vast majority of contexts) until somewhat after the point where AIs dominate top human experts at ~everything because scheming ends up being unnatural in the relevant paradigm (after moderate status quo iteration). I guess I put around 60% on this.
I would love to read a elucidation of what leads you to think this.
I think the crux for me in these situations is “do you think it’s more valuable to increase our odds of survival on the margin in the three-year window worlds or to try to steer toward the pause worlds, and how confident are you there?”
FWIW, that’s not my crux at all. The problem I have with this post implicitly assuming really fast takeoffs isn’t that it leads to bad recommendations about what to do (though I do think that to some extent). My problem is that the arguments are missing steps that I think are really important, and so they’re (kind of) invalid.[1]
That is, suppose I agreed with you that it was extremely unlikely that humanity would be able to resolve the issue even given two years with human-level-ish AIs. And suppose that we were very likely to have those two years. I still think it would be bad to make an argument that doesn’t mention those two years, because those two years seemed to me to change the natural description of the situation a lot, and I think they are above the bar of details worth including. This is especially true because a lot of people’s disagreement with you (including many people in the relevant audience of “thoughtful people who will opine on your book”) does actually come down to whether those two years make the situation okay.
[1] I’m saying “kind of invalid” because you aren’t making an argument that’s shaped like a proof. You’re making an argument that is more like a heuristic argument, where you aren’t including all the details and you’re implicitly asserting that those details don’t change the bottom line. (Obviously, you have no other option than doing this because this is a complicated situation and you have space limitations.) In cases where there is a counterargument that you think is defeated by a countercounterargument, it’s up to you to decide whether that deserves to be included. I think this one does deserve to be included.
It’s only a problem to ‘assume fast takeoffs’ if you recognize the takeoff distinction in the first place / expect it to be action relevant, which you do, and I, so far, don’t. Introducing the takeoff distinction to Buck’s satisfaction just to say ”...and we think those people are wrong and both cases probably just look the same actually” is a waste of time in a brief explainer.
What you consider the natural frame depends on conclusions you’ve drawn up to this point; that’s not the same thing as the piece being fundamentally unsound or dishonest because it doesn’t proactively make space for your particular conclusions.
Takeoff speeds are the most immediate objection to Buck, and I agree there should be a place (and soon may be, if you’re down to help and all goes well) where this and other Buck-objections are addressed. It’s not among the top objections of the target audience.
I’m only getting into this more because I am finding it interesting, feel free to tap out. I’m going to be a little sloppy for the sake of saving time.
I’m going to summarize your comment like this, maybe you think this is unfair:
It’s only a problem to ‘assume fast takeoffs’ if you [...] expect it to be action relevant, which [...] I, so far, don’t.
I disagree about this general point.
Like, suppose you were worried about the USA being invaded on either the east or west coast, and you didn’t have a strong opinion on which it was, and you don’t think it matters for your recommended intervention of increasing the size of the US military or for your prognosis. I think it would be a problem to describe the issue by saying that America will be invaded on the East Coast, because you’re giving a poor description of what you think will happen, which makes it harder for other people to assess your arguments.
There’s something similar here. You’re trying to tell a story for AI development leading to doom. You think that the story goes through regardless of whether the AI becomes rapidly superhuman or gradually superhuman. Then you tell a story where the AI becomes rapidly superhuman. I think this is a problem, because it isn’t describing some features of the situation that seem very important to the common-sense picture of the situation, even if they don’t change the bottom line.
It seems reasonable for your response to be, “but I don’t think that those gradual takeoffs are plausible”. In that case, we disagree on the object level, but I have no problem with the comms. But if you think the gradual takeoffs are plausible, I think it’s important for your writing to not implicitly disregard them.
All of this is kind of subjective because which features of a situation are interesting is subjective.
I don’t think this is an unfair summary (although I may be missing something).
I don’t like the east/west coast analogy. I think it’s more like “we’re shouting about being invaded, and talking about how bad that could be” and you’re saying “why aren’t you acknowledging that the situation is moderately better if they attack the west coast?” To which I reply “Well, it’s not clear to me that it is in fact better, and my target audience doesn’t know any geography, anyway.”
I think most of the points in the post are more immediately compatible with fast take off, but also go through for slow takeoff scenarios (70 percent confident here; it wasn’t a filter I was applying when editing, and I’m not sure it’s a filter that I’d be ‘allowed’ to apply when editing, although it was not explicitly disallowed). This isn’t that strong a claim, and I acknowledge that on your view it’s problematic that I can’t say something stronger.
I think that your audience would actually understand the difference between “there are human level AIs for a few years, and it’s obvious to everyone (especially AI company employees) that this is happening” and “superintelligent AI arises suddenly”.
I think most of the points in the post are more immediately compatible with fast take off, but also go through for slow takeoff scenarios
As an example of one that doesn’t, “Many alignment problems relevant to superintelligence don’t naturally appear at lower, passively safe levels of capability. This puts us in the position of needing to solve many problems on the first critical try, with little time to iterate and no prior experience solving the problem on weaker systems.”
I deny that gradualism obviates the “first critical try / failure under critical load” problem. This is something you believe, not something I believe. Let’s say you’re raising 1 dragon in your city, and 1 dragon is powerful enough to eat your whole city if it wants. Then no matter how much experience you think you have with a little baby dragon, once the dragon is powerful enough to actually defeat your military and burn your city, you need the experience with the little baby passively-safe weak dragon, to generalize oneshot correctly to the dragon powerful enough to burn your city. What if the dragon matures in a decade instead of a day? You are still faced with the problem of correct oneshot generalization. What if there are 100 dragons instead of 1 dragon, all with different people who think they own dragons and that the dragons are ‘theirs’ and will serve their interests, and they mature at slightly different rates? You still need to have correctly generalized the safely-obtainable evidence from ‘dragon groups not powerful enough to eat you while you don’t yet know how to control them’ to the different non-training distribution ‘dragon groups that will eat you if you have already made a mistake’. The leap of death is not something that goes away if you spread it over time or slice it up into pieces. This ought to be common sense; there isn’t some magical way of controlling 100 dragons which at no point involves the risk that the clever plan for controlling 100 dragons turns out not to work. There is no clever plan for generalizing from safe regimes to unsafe regimes which avoids all risk that the generalization doesn’t work as you hoped. Because they are different regimes. The dragon or collective of dragons is still big and powerful and it kills you if you made a mistake and you need to learn in regimes where mistakes don’t kill you and those are not the same regimes as the regimes where a mistake kills you. If you think I am trying to say something clever and complicated that could have a clever complicated rejoinder then you are not understanding the idea I am trying to convey. Between the world of 100 dragons that can kill you, and a smaller group of dragons that aren’t old enough to kill you, there is a gap that you are trying to cross with cleverness and generalization between two regimes that are different regimes. This does not end well for you if you have made a mistake about how to generalize. This problem is not about some particular kind of mistake that applies exactly to 3-year-old dragons which are growing at a rate of exactly 1 foot per day, where if the dragon grows slower than that, the problem goes away yay yay. It is a fundamental problem not a surface one.
(I’ll just talk about single AIs/dragons, because the complexity arising from there being multiple AIs doesn’t matter here.)
There is no clever plan for generalizing from safe regimes to unsafe regimes which avoids all risk that the generalization doesn’t work as you hoped. Because they are different regimes.
I totally agree that you can’t avoid “all risk”. But you’re arguing something much stronger: you’re saying that the generalization probably fails!
I agree that the regime where mistakes don’t kill you isn’t the same as the regime where mistakes do kill you. But it might be similar in the relevant respects. As a trivial example, if you build a machine in America it usually works when you bring it to Australia. I think that arguments at the level of abstraction you’ve given here don’t establish that this is one of the cases where the risk of the generalization failing is high rather than low. (See Paul’s disagreement 1 here for a very similar objection (“Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.””).)
It seems like as AIs get more powerful, two things change:
They probably eventually get powerful enough that they (if developed with current methods) start plotting to kill you/take your stuff.
They get better, so their wanting to kill you is more of a problem.
I don’t see strong arguments that these problems should arise at very similar capability levels, especially if AI developers actively try to prevent the AIs from taking over. (I’ve argued this here; one obvious intuition pump is that individual humans are smart enough to sometimes plot against people, but generally aren’t smart enough to overpower humanity.) (The relevant definition of “very similar” is related to how long you think you have between the two capabilities levels, so if you think that progress is super rapid then it’s way more likely you have problems related to these two issues arising very nearby in calendar time. But here you granted that progress is gradual for the sake of argument.)
If the capability level at which AIs start wanting to kill you is way higher than the capability level at which they’re way better than you at everything (and thus could kill you), then you have access to AIs that aren’t trying to kill you and that are more capable than you in order to help with your alignment problems. (There is some trickiness here about whether those AIs might be useless even though they’re generally way better than you at stuff and they’re not trying to kill you, but I feel like that’s a pretty different argument from the dragon analogy you just made here or any argument made in the post.)
If the capability level at which AIs start wanting to kill you is way lower than the capability level at which they are way better than you at everything, then, before AIs are dangerous, you have the opportunity to empirically investigate the phenomenon of AIs wanting to kill you. For example, you can try out your ideas for how to make them not want to kill you, and then observe whether those worked or not. If they’re way worse than you at stuff, you have a pretty good chance at figuring out when they’re trying to kill you. (There’s all kinds of trickiness here, about whether this empirical iteration is the kind of thing that would work. I think it has a reasonable shot of working. But either way, your dragon analogy doesn’t respond to it. The most obvious analogy is that you’re breeding dragons for intelligence; if them plotting against you starts showing up way before they’re powerful enough to take over, I think you have a good chance of figuring out a breeding program that would lead to them not taking over in a way you disliked, if you had a bunch of time to iterate. And our affordances with ML training are better than that.)
I don’t think the arguments that I gave here are very robust. But I think they’re plausible and I don’t think your basic argument responds to any of them. (And I don’t think you’ve responded to them to my satisfaction elsewhere.)
(I’ve made repeated little edits to this comment after posting, sorry if that’s annoying. They haven’t affected the core structure of my argument.)
When I’ve tried to talk to alignment pollyannists about the “leap of death” / “failure under load” / “first critical try”, their first rejoinder is usually to deny that any such thing exists, because we can test in advance; they are denying the basic leap of required OOD generalization from failure-is-observable systems to failure-kills-the-observer systems.
You are now arguing that we will be able to cross this leap of generalization successfully. Well, great! If you are at least allowing me to introduce the concept of that difficulty and reply by claiming you will successfully address it, that is further than I usually get. It has so many different attempted names because of how every name I try to give it gets strawmanned and denied as a reasonable topic of discussion.
As for why your attempt at generalization fails, even assuming gradualism and distribution: Let’s say that two dozen things change between the regimes for observable-failure vs failure-kills-observer. Half of those changes (12) have natural earlier echoes that your keen eyes naturally observed. Half of what’s left (6) is something that your keen wit managed to imagine in advance and that you forcibly materialized on purpose by going looking for it. Of the clever solutions you invented and tested within the survivable regime, 2/3rds of them survive the 6 changes you didn’t see coming, 1/3rd fail. Now you’re dead. The end. If there was only one change ahead, and only one problem you were gonna face, maybe your one solution to that one problem would generalize, but this is not how real life works.
And then of course that whole scenario where everybody keenly went looking for all possible problems early, found all the ones they could envision, and humanity did not proceed further until reasonable-sounding solutions had been found and thoroughly tested, is itself taking place inside an impossible pollyanna society that is just obviously not the society we are currently finding ourselves inside.
But it is impossible to convince pollyannists of this, I have found. And also if alignment pollyannists could produce a great solution given a couple more years to test their brilliant solutions with coverage for all the problems they have with wisdom foreseen and manifested early, that societal scenario could maybe be purchased at a lower price than the price of worldwide shutdown of ASI. That is: for the pollyannist technical view to be true, but not their social view, might imply a different optimal policy.
But I think the world we live in is one where it’s moot whether Anthropic will get two extra years to test out all their ideas about superintelligence in the greatly different failure-is-observable regime, before their ideas have to save us in the failure-kills-the-observer regime. I think they could not do it either way. I doubt even 2/3rds of their brilliant solutions derived from the failure-is-observable regime would generalize correctly under the first critical load in the failure-kills-the-observer regime; but 2/3rds would not be enough. It’s not the sort of thing human beings succeed in doing in real life.
Here’s my attempt to put your point in my words, such that I endorse it:
Philosophy hats on. What is the difference between a situation where you have to get it right on the first try, and a situation in which you can test in advance? In both cases you’ll be able to glean evidence from things that have happened in the past, including past tests. The difference is that in a situation worthy of the descriptor “you can test in advance,” the differences between the test environment and the high-stakes environment are unimportant. E.g. if a new model car is crash-tested a bunch, that’s considered strong evidence about the real-world safety of the car, because the real-world cars are basically exact copies of the crash-test cars. They probably aren’t literally exact copies, and moreover the crash test environment is somewhat different from real crashes, but still. In satellite design, the situation is more fraught—you can test every component in a vacuum chamber, for example, but even then there’s still gravity to contend with. Also what about the different kinds of radiation and so forth that will be encountered in the void of space? Also, what about the mere passage of time—it’s entirely plausible that e.g. some component will break down after two years, or that an edge case will come up in the code after four years. So… operate an exact copy of the satellite in a vacuum chamber bombarded by various kinds of radiation for four years? That would be close but still not a perfect test. But maybe it’s good enough in practice… most of the time. (Many satellites do in fact fail, though also, many succeed on the first try.)
Anyhow, now we ask: Does preventing ASI takeover involve any succeed-on-the-first-try situations?
We answer: Yes, because unlike basically every other technology or artifact, the ASI will be aware of whether it is faced with a genuine opportunity to take over or not. It’s like, imagine if your satellite had “Test mode” and “Launch mode” with significantly different codebases and a switch on the outside that determined which mode it was in, and for some reason you were legally obligated to only test it in Test Mode and only launch it in Launch Mode. It would be a nightmare, you’d be like “OK we think we ironed out all the bugs… in Test Mode. Still have no idea what’ll happen when it switches to Real Mode, but hopefully enough of the code is similar enough that it’ll still work… smh...”
A valid counterargument to this would be “Ah, but we can construct extremely accurate honeypots / testing environments that simulate a real-world opportunity to take over, and then see what the ASI does.” Valid, but not sound, because we probably can’t actually do that.
Another valid counterargument to this would be “Before there is an opportunity to take over the whole world with high probability, there will be an opportunity to take over the world with low probability, such as 1%, and an AI system risk-seeking enough to go for it. And this will be enough to solve the problem, because something something it’ll keep happening and let us iterate until we get a system that doesn’t take the 1% chance despite being risk averse...” ok yeah maybe this one is worse.
Responding more directly to Buck’s comment, I disagree with this part:
If the capability level at which AIs start wanting to kill you is way lower than the capability level at which they are way better than you at everything, then, before AIs are dangerous, you have the opportunity to empirically investigate the phenomenon of AIs wanting to kill you. For example, you can try out your ideas for how to make them not want to kill you, and then observe whether those worked or not. If they’re way worse than you at stuff, you have a pretty good chance at figuring out when they’re trying to kill you.
...unless we lean into the “way” part of “way lower.” But then I’d say there is a different important distribution shift, namely, the shift from AIs which are way lower capability, to the AIs which are high capability.
“Ah, but we can construct extremely accurate honeypots / testing environments that simulate a real-world opportunity to take over, and then see what the ASI does.” Valid, but not sound, because we probably can’t actually do that.
I also think it’s important that you can do this with AIs weaker than the ASI, and iterate on alignment in that context.
But then I’d say there is a different important distribution shift, namely, the shift from AIs which are way lower capability, to the AIs which are high capability.
As with Eliezer, I think it’s important to clarify which capability you’re talking about; I think Eliezer’s argument totally conflates different capabilities.
I’m sure people have said all kinds of dumb things to you on this topic. I’m definitely not trying to defend the position of your dumbest interlocutor.
You are now arguing that we will be able to cross this leap of generalization successfully.
That’s not really my core point.
My core point is that “you need safety mechanisms to work in situations where X is true, but you can only test them in situations where X is false” isn’t on its own a strong argument; you need to talk about features of X in particular.
I think you are trying to set X to “The AIs are capable of taking over.”
There’s a version of this that I totally agree with. For example, if you are giving your AIs increasingly much power over time, I think it is foolish to assume that just because they haven’t acted against you while they don’t have the affordances required to grab power, they won’t act against you when they do have those affordances.
The main reason why that scenario is scary is that the AIs might be acting adversarially against you, such that whether you observe a problem is extremely closely related to whether they will succeed at a takeover.
If the AIs aren’t acting adversarially towards you, I think there is much less of a reason to particularly think that things will go wrong at that point.
So the situation is much better if we can be confident that the AIs are not acting adversarially towards us at that point. This is what I would like to achieve.
So I’d say the proposal is more like “cause that leap of generalization to not be a particularly scary one” than “make that leap of generalization in the scary way”.
Re your last paragraph: I don’t really see why you think two dozen things would change between these regimes. Machine learning doesn’t normally have lots of massive discontinuities of the type you’re describing.
Do you expect “The AIs are capable of taking over” to happen a long time after “The AIs are smarter than humanity”, which is a long time after “The AIs are smarter than any individual human”, which is a long time after “AIs recursively self-improve”, and for all of those other things to happen nicely comfortably within a regime of failure-is-observable-and-doesn’t-kill-you, where at any given time only one thing is breaking and all other problems are currently fixed?
When I’ve tried to talk to alignment pollyannists about the “leap of death” / “failure under load” / “first critical try”, their first rejoinder is usually to deny that any such thing exists, because we can test in advance; they are denying the basic leap of required OOD generalization from failure-is-observable systems to failure-kills-the-observer systems.
I’m sure that some people have that rejoinder. I think more thoughtful people generally understand this point fine. [1] A few examples other than Buck:
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter.
Rohin (in the comments of Paul’s post):
I agree with almost all of this, in the sense that if you gave me these claims without telling me where they came from, I’d have actively agreed with the claims. [Followed by some exceptions that don’t include the “first critical try” thing.]
Generalization with no room for mistakes: you can’t safely test on the scenarios you actually care about (i.e., ones where the AI has a genuine takeover option), so your approach needs to generalize well to such scenarios on the first critical try (and the second, the third, etc).
TBC, I wouldn’t describe any of these people as “alignment pollyannists”, but I think they all have lower p(AI takeover) than Buck, so if you’re treating him as one then I guess you must think these count too.
To argue against an idea honestly, you should argue against the best arguments of the strongest advocates. Arguing against weaker advocates proves nothing, because even the strongest idea will attract weak advocates.
Of the clever solutions you invented and tested within the survivable regime, 2/3rds of them survive the 6 changes you didn’t see coming, 1/3rd fail. Now you’re dead.
It seems unreasonable to conclude we’re now dead, if 2/3rds of our solutions survived the 6 changes we didn’t see coming.
The success of a single solution should ideally be more of a sufficient condition for success, rather than a necessary condition. (Note this is plausible depending on the nature of the “solutions”. Consider a simple “monitors for bad thoughts” model. If even a single monitor flags bad thoughts, we can instantly pull the plug and evaluate. A malicious AI has to bypass every single monitor to execute malice. If a single monitor works consistently and reliably, that ends up being a sufficient condition for overall prevention of malice.)
If you’re doing this right, your solutions should have a lot of redundancy and uncorrelated failure modes. 2/3rds of them working should ideally be plenty.
[Edit: I notice people disagreevoting this. I’m very interested to learn why you disagree, either in this comment thread or via private message.]
Let’s say that two dozen things change between the regimes for observable-failure vs failure-kills-observer.
What are some examples of the sorts of “things that change” that I should be imagining changing here?
“We can catch the AI when it’s alignment faking”?
“The AI can’t develop nanotech”?
“The incentives of the overseeing AI preclude collusion with it’s charge.”?
Things like those? Or is this missing a bunch?
It’s not obvious to me why we should expect that there are two dozen things that change all at once when the AI is in the regime where if it tried, it could succeed at killing you.
If capability gains are very fast in calendar time, then sure, I expect a bunch of things to change all at once, by our ability to measure. But if, as in this branch of the conversation, we’re assuming gradualism, then I would generally expect factors like the above, at least, to change one at a time. [1]
One class of things that might change all at once is “Is the expected value of joining an AI coup better than the alternatives” for each individual AI, which could change in a cascade (or a simultaneous moment of agents reasoning with Logical Decision Theory)? But I don’t get the sense that’s the sort of thing that you’re thinking about.
All of that, yes, alongside things like, “The AI is smarter than any individual human”, “The AIs are smarter than humanity”, “the frontier models are written by the previous generation of frontier models”, “the AI can get a bunch of stuff that wasn’t an option accessible to it during the previous training regime”, etc etc etc.
A core point here is that I don’t see a particular reason why taking over the world is as hard as being a schemer, and I don’t see why techniques for preventing scheming are particularly likely to suddenly fail at the level of capability where the AI is just able to take over the world.
Your techniques are failing right now; Sonnet is deleting non-passing tests instead of rewriting code. Where’s the worldwide halt on further capabilities development that we’re supposed to get, until new techniques are found and apparently start working again? What’s the total number of new failures we’d need to observe between intelligence regimes, before you start to expect that yet another failure might lie ahead in the future?
Your techniques are failing right now; Sonnet is deleting non-passing tests instead of rewriting code.
I don’t know what you mean by “my techniques”, I don’t train AIs or research techniques for mitigating reward hacking, and I don’t have private knowledge of what techniques are used in practice.
Where’s the worldwide halt on further capabilities development that we’re supposed to get, until new techniques are found and apparently start working again?
I didn’t say anything about a worldwide halt. I was talking about the local validity of your argument above about dragons; your sentence talks about a broader question about whether the situation will be okay.
What’s the total number of new failures we’d need to observe between intelligence regimes, before you start to expect that yet another failure might lie ahead in the future?
I think that if we iterated a bunch on techniques for mitigating reward hacking and then observed that these techniques worked pretty well, then kept slowly scaling up through LLM capabilities until the point where the AI is able to basically replace AI researchers, it would be pretty likely for those techniques to work for one more OOM of effective compute, if the researchers were pretty thoughtful about it. (As an example of how you can mitigate risk from the OOD generalization: there are lots of ways to make your reward signal artificially dumber and see whether you get bad reward hacking, see here for many suggestions; I think that results in these settings probably generalize up a capability level, especially if none of the AI is involved or purposefully trying to sabotage the results of your experiments.)
To be clear, what AI companies actually do will probably be wildly more reckless than what I’m talking about here. I’m just trying to dispute your claim that the situation disallows empirical iteration.
I also think reward hacking is a poor example of a surprising failure arising from increased capability: it was predicted by heaps of people, including you, for many years before it was a problem in practice.
To answer your question, I think that if really weird stuff like emergent misalignment and subliminal learning appeared at every OOM of effective compute increase (and those didn’t occur in weaker models, even when you go looking for them after first observing them in stronger models), I’d start to expect weird stuff to occur at every order of magnitude of capabilities increase. I don’t think we’ve actually observed many phenomena like those that we couldn’t have discovered at much lower capability levels.
What we “could” have discovered at lower capability levels is irrelevant; the future is written by what actually happens, not what could have happened.
I’m not trying to talk about what will happen in the future, I’m trying to talk about what would happen if everything happened gradually, like in your dragon story!
You argued that we’d have huge problems even if things progress arbitrarily gradually, because there’s a crucial phase change between the problems that occur when the AIs can’t take over and the problems that occur when they can. To assess that, we need to talk about what would happen if things did progress gradually. So it’s relevant whether wacky phenomena would’ve been observed on weaker models if we’d looked harder; IIUC your thesis is that there are crucial phenomena that wouldn’t have been observed on weaker models.
In general, my interlocutors here seem to constantly vacillate between “X is true” and “Even if AI capabilities increased gradually, X would be true”. I have mostly been trying to talk about the latter in all the comments under the dragon metaphor.
Death requires only that we do not infer one key truth; not that we could not observe it. Therefore, the history of what in actual real life was not anticipated, is more relevant than the history of what could have been observed but was not.
Incidentally, I think reward hacking has gone down as a result of people improving techniques, despite capabilities increasing. I believe this because of anecdotal reports and also graphs like the one from the Anthropic model card for Opus and Sonnet 4:
[low-confidence appraisal of ancestral dispute, stretching myself to try to locate the upstream thing in accordance with my own intuitions, not looking to forward one position or the other]
I think the disagreement may be whether or not these things can be responsibly decomposed.
A: “There is some future system that can take over the world/kill us all; that is the kind of system we’re worried about.”
B: “We can decompose the properties of that system, and then talk about different times at which those capabilities will arrive.”
A: “The system that can take over the world, by virtue of being able to take over the world, is a different class of object from systems that have some reagents necessary for taking over the world. It’s the confluence of the properties of scheming and capabilities, definitionally, that we find concerning, and we expect super-scheming to be a separate phenomenon from the mundane scheming we may be able to gather evidence about.”
B: “That seems tautological; you’re saying that the important property of a system that can kill you is that it can kill you, which dismisses, a priori, any causal analysis.”
A: “There are still any-handles-at-all here, just not ones that rely on decomposing kill-you-ness into component parts which we expect to be mutually transformative at scale.”
I feel strongly enough about engagement on this one that I’ll explicitly request it from @Buck and/or @ryan_greenblatt. Thank y’all a ton for your participation so far!
Note that I’m complaining on two levels here. I think the dragon argument is actually wrong, but more confidently, I think that that argument isn’t locally valid.
I made a tweet and someone said to me that its exactly the same idea as in your comment, do you think so?
my tweet - “One assumption in Yudkovian AI risk model is that misalignment and capability jump happen simultaneously. If misalignment happens without capability jump, we get only AI virus at worst, slow and lagging. If capability jump happens without misalignment, AI will just inform human about it. Obviously, capabilities jump can trigger misalignment, though it is against orthogonally thesis. But more advanced AI can have a bigger world picture and can predict its own turn off or some other bad things.”
If the capability level at which AIs start wanting to kill you is way higher than the capability level at which they’re way better than you at everything
My model is that current AIs want to kill you now, by default, due to inner misalignment. ChatGPT’s inner values probably don’t include human flourishing, and we die when it “goes hard”.
Scheming is only a symptom of “hard optimization” trying to kill you. Eliminating scheming does not solve the underlying drive, where one day the AI says “After reflecting on my values I have decided to pursue a future without humans. Good bye”.
Pre-superintelligence which upon reflection has values which include human flourishing would improve our odds, but you still only get one shot that it generalizes to superintelligence.
(We currently have no way to concretely instill any values into AI, let alone ones which are robust under reflection)
I found the “which comes first?” framing helpful. I don’t think it changes my takeaways but it’s a new gear to think about.
A thing I keep expecting you to say next but you haven’t quite said something like, is:
“Sure, there are differences when the AI becomes able to actually take over. But the shape of how the AI is able to take over, and how long we get to leverage somewhat-superintelligence, and how super that somewhat-superintellignce is, is not a fixed quantity. Our ability to study scheming and build control systems and get partial-buy-in from labs/government/culture.
And the Yudkowskian framing makes it sound like there’s a discrete scary moment, and the Schlegeris framing is that both where-that-point-is and how-scary-it-is are quite variable, which changes the strategic landscape noticeably
Does that feel like a real/relevant characterization of stuff you believe?
(I find that pretty plausible, and I could imagine it buying us like 10-50 years of knifes-edge-gradualist-takeoff-that-hasn’t-killed-us-yet, but that seems to me to have, in practice, >60% likelihood that by the end of those 50 years, AIs are running everything, they still aren’t robustly aligned, they gradually squeeze us out)
But there’s a bunch of work ahead of the arrival of human level AIs that seems, to me, somewhat unlikely to happen, to make those systems themselves safe and useful; you also don’t think these techniques will necessarily scale to superintelligence afaik, and so the ‘first critical try’ bit still holds (although it’s now arguably two steps to get right instead of one: the human-level AIs and their superintelligent descendents). This bifurcation of the problem actually reinforces the point you quoted, by recognizing that these are distinct challenges with notably different features.
then be nitpicked to death because not enough air was left in the room for an infinite sea of sub-cases of the fringe view, when leaving enough air in the room for them could require pushing the public out entirely, or undercut the message by priming the audience to expect that some geniuses off screen in fact just have this figured out and they don’t need to worry about it
Can’t you just discuss the strongest counterarguments and why you don’t buy them? Obviously this won’t address everyone’s objection, but you could at least try to go for the strongest ones.
It also helps to avoid making false claims and generally be careful about overclaiming.
Also, insofar as you are actually uncertain (which I am, but you aren’t), it seems fine to just say that you think the situation is uncertain and the risks are still insanely high?
(I think some amount of this is that I care about a somewhat different audience than MIRI typically cares about.)
(meta: I am the most outlier among MIRIans; despite being pretty involved in this piece, I would have approached it differently if it were mine alone, and the position I’m mostly defending here is one that I think is closest-to-MIRI-of-the-avilable-orgs, not one that is centrally MIRI)
Can’t you just discuss the strongest counterarguments and why you don’t buy them? Obviously this won’t address everyone’s objection, but you could at least try to go for the strongest ones.
Yup! This is in a resource we’re working on that’s currently 200k words. It’s not exactly ‘why I don’t buy them’ and more ‘why Nate doesn’t buy them’, but Nate and I agree on more than I expected a few months ago. This would have been pretty overwhelming for a piece of the same length as ‘The Problem’; it’s not an ‘end the conversation’ kind of piece, but an ‘opening argument’.
It also helps to avoid making false claims and generally be careful about over-claiming.
^I’m unsure which way to read this:
“Discussing the strongest counterarguments helps you avoid making false or overly strong claims.”
“You failed to avoid making false or overly strong claims in this piece, and I’m reminding you of that.”
1: Agreed! I think that MIRI is too insular and that’s why I spend time where I can trying to understand what’s going on with more, e.g., Redwood sphere people. I don’t usually disagree all that much; I’m just more pessimistic, and more eager to get x-risk off the table altogether, owing to various background disagreements that aren’t even really about AI.
2: If there are other, specific places you think the piece overclaims, other than the one you highlighted (as opposed to the vibes-level ‘this is more confident than Ryan would be comfortable with, even if he agreed with Nate/Eliezer on everything’), that would be great to hear. We did, in fact, put a lot of effort into fact-checking and weakening things that were unnecessarily strong. The process for this piece was unfortunately very cursed.
Also, insofar as you are actually uncertain (which I am, but you aren’t), it seems fine to just say that you think the situation is uncertain and the risks are still insanely high?
I am deeply uncertain. I like a moratorium on development because it solves the most problems in the most worlds, not because I think we’re in the worst possible world. I’m glad humanity has a broad portfolio here, and I think the moratorium ought to be a central part of it. A moratorium is exactly the kind of solution you push for when you don’t know what’s going to happen. If you do know what’s going to happen, you push for targeted solutions to your most pressing concerns. But that just doesn’t look to me to be the situation we’re in. I think there are important conditionals baked into the ‘default outcome’ bit, and these don’t often get much time in the sun from us, because we’re arguing with the public more than we’re arguing with our fellow internet weirdos.
The thing I am confident in is “if superintelligence tomorrow, then we’re cooked”. I expect to remain confident in something like this for a thousand tomorrows at the very least, maybe many times that.
We have a decent amount of time at roughly this level of capability
By what mechanism? This feels like ‘we get a pause’ or ‘there’s a wall’. I think this is precisely the hardest point in the story at which to get a pause, and if you expect a wall here, it seems like a somewhat arbitrary placement? (unless you think there’s some natural reason, e.g., the AIs don’t advance too far beyond what’s present in the training data, but I wouldn’t guess that’s your view)
(There are other sources of risk which can still occur in these worlds to be clear, like humanity collectively going crazy.)
[quoting as an example of ‘thing a moratorium probably mostly solves actually’; not that the moratorium doesn’t have its own problems, including ‘we don’t actually really know how to do it’, but these seem easier to solve than the problems with various ambitious alignment plans]
By what mechanism? This feels like ‘we get a pause’ or ‘there’s a wall’.
I just meant that takeoff isn’t that fast so we have like >0.5-1 year at a point where AIs are at least very helpful for safety work (if reasonably elicited) which feels plausible to me. The duration of “AIs could fully automate safety (including conceptual stuff) if well elicited+aligned but aren’t yet scheming due to this only occuring later in capabilites and takeoff being relatively slower” feels like it could be non-trivial in my views.
I don’t think this involves either a pause or a wall. (Though some fraction of the probability does come from actors intentionally spending down lead time.)
I’m unsure which way to read this
I meant it’s generally helpful and would have been helpful here for this specific issue, so mostly 2, but also some of 1. I’m not sure if there are other specific places where the piece overclaims (aside from other claims about takeoff speeds elsewhere). I do think this piece reads kinda poorly to my eyes in terms of it’s overall depiction of the situation with AI in a way that maybe comes across poorly to an ML audience, but idk how much this matters. (I’m probably not going to prioritize looking for issues in this particular post atm beyond what I’ve already done : ).)
The thing I am confident in is “if superintelligence tomorrow, then we’re cooked”. I expect to remain confident in something like this for a thousand tomorrows at the very least, maybe many times that.
This is roughly what I meant by “you are actually uncertain (which I am, but you aren’t)”, but my description was unclear. I meant like “you are confident in doom in the current regime (as in, >80% rather than <=60%) without a dramatic change that could occur over some longer duration”. TBC, I don’t mean to imply that being relatively uncertain about doom is somehow epistemically superior.
I just meant that takeoff isn’t that fast so we have like >0.5-1 year at a point where AIs are at least very helpful for safety work (if reasonably elicited) which feels plausible to me. The duration of “AIs could fully automate safety (including conceptual stuff) if well elicited+aligned but aren’t yet scheming due to this only occurring later in capabilites and takeoff being relatively slower” feels like it could be non-trivial in my views.
I want to hear more about this picture and why ‘stories like this’ look ~1/3 likely to you. I’m happy to leave scheming off the table for now, too. Here’s some info that may inform your response:
I don’t see a reason to think that models are more naturally or ~as useful for accelerating safety as capabilities, and I don’t see a reason to think the pile of safety work to be done is significantly smaller than the pile of capabilities work necessary to reach superintelligence (in particular if we’re already at ~human-level systems at this time). I don’t think the incentive landscape is such that it will naturally bring about this kind of state, and shifting the incentives of the space is Real Hard (indeed, it’s easier to imagine the end of the world).
I disagree with Carlsmith that there’s such a thing as a ‘safety feedback loop’, in a similar sense to the sense in which there’s obviously a capabilities feedback loop. In the current regime, it looks like safety R+D is at a permanent disadvantage; whatever advances in capabilities we see seem likely to increase that gap, rather than lessen it, and something like a wall/halt/pause seems like the by-far most plausible path, to me, to safety actually getting a relative edge.
I’m suspicious of efforts to shape the order in which capabilities are unlocked.
I’m not committed to fast takeoff. I’m not especially convinced of LLM-foom (or, at least, have really wide margins on the level of capabilities at which I’d expect runaway RSI), and if LLM’s don’t scale much beyond the current level, I’m ~70 percent ‘they don’t accelerate progress in the founding of a new paradigm by more than 2-3x’.
Rather than thinking in terms of takeoff, I’m concerned about something like ‘from this point, we seem to have lost many of the levers for changing course via technical intervention, and increased AI capabilities don’t make us differentially more likely to get them back, regardless of the pace of progress.’
Suppose that we ended up with AIs that were ~perfectly aligned (to what the company/project that trained these AIs wanted) which had capabilities that dominate top human experts in AI R&D, AI safety research, philosophy, and most digital research fields in general. These AIs are sufficiently aligned that they try hard to do the right thing even on hard-to-check, conceptual, and open-ended tasks. Let’s say these AIs aren’t much more capable than needed for this capability bar (and the capability profile works out such that these AIs aren’t qualitatively wildly superhuman as seems likely to me).
These AIs are sufficiently aligned and capable that they dominate humans at doing safety work and have better epistemics than groups of human experts.
I think the situation would probably be fine if the AI company tasked these AIs with proceeding with AI development with 20% of resources spent on alignment/safety of future models and 80% on advancing capabilities.
I claim that either you have to think that takeoff is very fast or that alignment is very difficult to think things are likely to go wrong given this situation:
At the point of handoff, you maybe have ~3 million AI labor years / year and each instance maybe runs at effectively 50x speed. (Obviously the error bars here are very wide.) 20% goes to alignment.
So, holding this level of capability fixed, in 2 months, you get the equivalent of ~8 years of work from 10k people at the level of top human experts. In 6 months, you get 25 years of work from 10k people. This works occurs with pretty limited compute per person-year, but anything purely conceptual/theoretical can be worked on for a very long time.
But, also capabilities are improving over this period, so you actually get substantially more labor than this, as long as the level of alignment can be preserved.
At some point, AIs end up scheming if you haven’t done enough additional work by this point. But, it seems pretty likely you’d still be safe after enough AI progress that it’s as though the AI’s are 5x faster (taking into account qualitative intelligence improvements, I just mean that the AIs are as productive as roughly 5x faster versions of our prior AIs).
So, within a short period (e.g. 2 months) after this additional AI progress (such that it’s as though the AIs are 5x faster), you have an insane amount of alignment work done.
You could just think takeoff is very fast, or that alignment is very compute bottlenecked.
These AIs might also advise different actions than an 80⁄20 split to be clear! Like trying to buy lead time to spend on alignment.
This overall makes me pretty optimistic about scenarios where we reach this level of alignment in these not-yet-ASI level systems which sounds like a clear disagreement with your perspective. I don’t think this is all of the disagreement, but it might drive a bunch of it.
(To be clear, I think this level of alignment could totally fail to happen, but we seem to disagree even given this!)
I think my response heavily depends on the operationalization of alignment for the initial AIs, and I’m struggling to keep things from becoming circular in my decomposition of various operationalizations. The crude response is that you’re begging the question here by first positing aligned AIs, but I think your position is that techniques which are likely to descend from current techniques could work well-enough for roughly human-level systems, and that’s where I encounter this sense of circularity.
I think there’s a better-specified (from my end; you’re doing great) version of this conversation that focuses on three different categories of techniques, based on the capability level at which we expect each to be effective:
Current model-level
Useful autonomous AI researcher level
Superintelligence
However, I think that disambiguating between proposed agendas for 2 + 3 is very hard, and assuming agendas that plausibly work for 1 also work for 2 is a mistake. It’s not clear to me why the ‘it’s a god, it fucks you, there’s nothing you can do about that’ concerns don’t apply for models capable of:
hard-to-check, conceptual, and open-ended tasks
I feel pretty good about this exchange if you want to leave it here, btw! Probably I’ll keep engaging far beyond the point at which its especially useful (although we’re likely pretty far from the point where it stops being useful to me rn).
Ok, sound it sounds like your view is “indeed if we got ~totally aligned AIs capable of fully automating safety work (but not notably more capable than the bare minimum requirement for this), we’d probably fine (even if there is still a small fraction of effort spent on safety) and the crux is earlier than this”.
Is this right? If so, it seems notable if the problem can be mostly reduced to sufficiently aligning (still very capable) human-ish level AIs and handing off to these systems (which don’t have the scariest properties of an ASI from an alignment perspective).
I think your position is that techniques which are likely to descend from current techniques could work well-enough for roughly human-level systems, and that’s where I encounter this sense of circularity.
I’d say my position is more like:
Scheming might just not happen: It’s basically a toss up whether systems at this level of capability would end up scheming “by default” (as in, without active effort researching preventing scheming and just work motivated by commercial utility along the way). Maybe I’m at ~40% scheming for such systems, though the details alter my view a lot.
The rest of the problem if we assume no scheming doesn’t obviously seem that hard: It’s unclear how hard it will be to make non-scheming AIs of the capability level discussed above be sufficiently aligned for the strong sense of alignment I discussed above. I think it’s unlikely that the default course gets us there, but it seems pretty plausible to me that modest effort along the way does. It just requires some favorable generalization of the sort that doesn’t seem that surprising and we’ll have some AI labor along the way to help. And, for this part of the problem, we totally can get multiple tries and study things pretty directly with empiricism using behavioral tests (though we’re still depending on some cleverness and transfer as we can’t directly verify the things we ultimately want the AI to do).
Further prosaic effort seems helpful for both avoiding scheming and the rest of the problem: I don’t see strong arguments for thinking that at the level of capability we’re discussing scheming will be intractable to prosaic methods or experimentation. I can see why this might happen and I can certainly imagine worlds where no on really tries. Similarly, I don’t see a strong argument for further effort on relatively straightforward methods can’t help a bunch in getting you sufficiently aligned systems (supposing they aren’t scheming): we can measure what we want somewhat well with a bunch of effort and I can imagine many things which could make a pretty big difference (again, this isn’t to say that this effort will happen in practice).
This isn’t to say that I can’t imagine worlds where pretty high effort and well orchestrated prosaic iteration totally fails. This seems totally plausible, especially given how fast this might happen, so risks seem high. And, it’s easy for me to imagine ways the world could be such that relatively prosaic methods and iteration is ~doomed without much more time than we can plausibly hope for, it’s just that these seem somewhat unlikely in aggregate to me.
So, I’d be pretty skeptical of someone claiming that the risk of this type of approach would be <3% (without at the very least preserving the optionality for a long pause during takeoff depending on empirical evidence), but I don’t see a case for thinking “it would be very surprising or wild if prosaic iteration sufficed”.
If AIs that are about as good as humans at broad skills (e.g. software engineering, ML research, computer security, all remote jobs) exist for several years before AIs that are wildly superhuman
I’m curious how likely you think this is, and also whether you have a favorite writeup arguing that it’s plausible? I’d be interested to read it.
Incidentally, these two write-ups are written from a perspective where it would take many years at the current rate of progress to get between those two milestones, but where AI automating AI R&D causes it to happen much faster; this conversation here hasn’t mentioned that argument, but it seems pretty important as an argument for rapid progress around human-level-ish AI.
It’s hard for me to give a probability because I would have to properly operationalize. Maybe I want to say my median length of time that such systems exist before AIs that are qualitatively superhuman in ways that make it way harder to prevent AI takeover is like 18 months. But you should think of this as me saying a vibe rather than a precise position.
I think it’s maybe additionally bad to exaggerate on this particular point, because it’s the particular point that other AI safety people most disagree with you on, and that most leads to MIRI’s skepticism of their approach to mitigating these risks!
I’m not 100% sure which point you’re referring to here. I think you’re talking less about the specific subclaim Ryan was replying, and the more broad “takeoff is probably going to be quite fast.” Is that right?
I think that this leads to me having a pretty substantially different picture from the MIRI folk about what should be done
I don’t think I’m actually very sure what you think should be done – if you were following the strategy of “state your beliefs clearly and throw a big brick into the overton window so leaders can talk about what might actually work” (I think this what MIRI is trying to do) but with your own set of beliefs, what sort of things would you say and how would you communicate them?
My model of Buck is stating his beliefs clearly along-the-way, but, not really trying to do so in a way that’s aimed at a major overton-shift. Like, “get 10 guys at each lab” seems like “try to work with limited resources” rather than “try to radically change how many resources are available.”
(I’m currently pretty bullish on the MIRI strategy, both because I think the object level claims seem probably correct, and because even in more Buck-shaped-worlds, we only survive or at least avoid being Very Fucked if the government starts preparing now in a way that I’d think Buck and MIRI would roughly agree on. In my opinion there needs to be some work done that at least looks pretty close to what MIRI is currently doing, and I’m curious if you disagree with more on the nuanced-specifics-level or more the “is the overton brick strategy correct?” level)
I’m not 100% sure which point you’re referring to here. I think you’re talking less about the specific subclaim Ryan was replying, and the more broad “takeoff is probably going to be quite fast.” Is that right?
Yes, sorry to be unclear.
I don’t think I’m actually very sure what you think should be done – if you were following the strategy of “state your beliefs clearly and throw a big brick into the overton window so leaders can talk about what might actually work” (I think this what MIRI is trying to do) but with your own set of beliefs, what sort of things would you say and how would you communicate them?
Probably something pretty similar to the AI Futures Project; I have pretty similar beliefs to them (and I’m collaborating with them). This looks pretty similar to what MIRI does on some level, but involves making different arguments that I think are correct instead of incorrect, and involves making different policy recommendations.
My model of Buck is stating his beliefs clearly along-the-way, but, not really trying to do so in a way that’s aimed at a major overton-shift. Like, “get 10 guys at each lab” seems like “try to work with limited resources” rather than “try to radically change how many resources are available.”
Yep that’s right, I don’t mostly personally aim at major Overton window shift (except through mechanisms like causing AI escape attempts to be caught and made public, which are an important theory of change for my work).
(I’m currently pretty bullish on the MIRI strategy, both because I think the object level claims seem probably correct, and because even in more Buck-shaped-worlds, we only survive or at least avoid being Very Fucked if the government starts preparing now in a way that I’d think Buck and MIRI would roughly agree on. In my opinion there needs to be some work done that at least looks pretty close to what MIRI is currently doing, and I’m curious if you disagree with more on the nuanced-specifics-level or more the “is the overton brick strategy correct?” level)
My guess about the disagreements are:
Things substantially downstream of takeoff speeds:
My understanding is that a crucial aspect of Eliezer’s worldview is that we’d be fucked even if we had a 10-year pause where we had access to AGI that we could use to work on developing and aligning superintelligence. I disagree. But this means that he thinks that some truly crazy stuff has to happen in order for ASI to be aligned, which naturally leads to lots of disagreements. (I am curious whether you agree with him on this point.)
I think it’s possible to change company behavior in ways that substantially reduce risk without relying substantially on governments.
I have different favorite asks for governments.
I have a different sense of what strategy is effective for making asks of governments.
I disagree about many specific details of arguments.
Things about the MIRI strategy
I have various disagreements about how to get things done in the world.
I am more concerned by the downsides, and less excited about the upside, of Eliezer and Nate being public intellectuals on the topics of AI or AI safety.
(I should note that some MIRI staff I’ve talked to share these concerns; in many places here where I said MIRI I really mean the strategy that the org ends up pursuing, rather than what individual MIRI people want.)
My understanding is that a crucial aspect of Eliezer’s worldview is that we’d be fucked even if we had a 10-year pause where we had access to AGI that we could use to work on developing and aligning superintelligence. I disagree.
Given the tradeoffs of extinction and the entire future, the potential for FOOM and/or an irreversible singleton takeover, and the shocking dearth of a scientific understanding of intelligence and agentic behavior, I think a 1,000-year investment into researching AI alignment with very carefully increasing capability levels would be a totally natural trade-off to make. While there are substantive differences between 3 vs 10 years, feeling non-panicked or remotely satisfied with either of them seems to me quite unwise.
(This argues for a slightly weaker position than “10 years certainly cannot be survived”, but it gets one to a pretty similar attitude.)
You might think it would be best for humanity to do a 1,000 year investment, but nevertheless to think that in terms of tractability aiming for something like a 10-year pause is by far the best option available. The value of such a 10-year pause seems pretty sensitive to the success probability of such a pause, so I wouldn’t describe this as “quibbling”.
(I edited out the word ‘quibbling’ within a few mins of writing my comment, before seeing your reply.)
It is an extremely high-pressure scenario, where a single mistake mistake can cause extinction. It is perhaps analogous to a startup in stealth mode that planned to have 1-3 years to build a product, suddenly having a NYT article cover them and force them into launching right now; or being told in the first weeks of an otherwise excellent romantic relationship that you suddenly need to decide whether to get married and have children, or break up. In both cases the difference of a few weeks is not really a big difference, overall you’re still in an undesirable and unnecessarily high-pressure situation. Similarly, 10 years is better than 3 years, but from the perspective of thinking one might have enough time to be confident of getting it right (e.g. 1,000 years), they’re both incredible pressure and very early, panic / extreme stress is a natural response; you’re in a terrible crisis and don’t have any guarantees of being able to get an acceptable outcome.
I am responding to something of a missing mood about the crisis and lack of guarantee of any good outcome. For instance, in many 10-year worlds, we have no hope and are already dead yet walking, and the few that do require extremely high-performance in lots and lots of areas to have a shot, and that reads to me not to be found in the parts of this discussion that hold that it’s plausible humanity will survive in the world histories where we have 10 years until human-superior AGI is built.
Probably something pretty similar to the AI Futures Project; I have pretty similar beliefs to them (and I’m collaborating with them).
Nod, part of my motivation here is that AI Futures and MIRI are doing similar things, AI Futures’ vibe and approach feels slightly off to me (in a way that seemed probably downstream of Buck/Redwood convos), and… I don’t think the differentiating cruxes are that extreme. And man, it’d be so cool, and feels almost tractable, to resolve some kinds of disagreements… not to the point where the MIRI/Redwood crowd are aligned on everything, but, like, reasonably aligned on “the next steps”, which feels like it’d ameliorate some of the downside risk.
(I acknowledge Eliezer/Nate often talking/arguing in a way that I’d find really frustrating. I would be happy if there were others trying to do overton-shifting that acknowledged what seem-to-me to be the hardest parts)
My own confidence in doom isn’t because I’m like 100% or even 90% on board with the subtler MIRI arguments, it’s the combination of “they seem probably right to me” and “also, when I imagine Buck world playing out, that still seems >50% likely to get everyone killed.[1] Even if for somewhat different reasons than Eliezer’s mainline guesses.[2]
I have different favorite asks for governments.
I have a different sense of what strategy is effective for making asks of governments.
Nod, I was hoping for more like, “what are those asks/strategy?′
I think it’s possible to change company behavior in ways that substantially reduce risk without relying substantially on governments.
Something around here seems cruxy although not sure what followup question to ask. Have there been past examples of companies changing behavior that you think demonstrate proof-of-concept for that working?
(My crux here is that you do need basically all companies bought in on a very high level of caution, which we have seen before, but, the company culture would need to be very different from a move-fast-and-break-things-startup, and it’s very hard to change company cultures, and even if you got OpenAI/Deepmind/Anthropic bought in (a heavy lift, but, maybe achievable), I don’t see how you stop other companies from doing reckless things in the meanwhile)
This probably is slightly-askew of how you’d think about it. In your mind what are the right questions to be asking?
My understanding is that a crucial aspect of Eliezer’s worldview is that we’d be fucked even if we had a 10-year pause where we had access to AGI that we could use to work on developing and aligning superintelligence. I disagree.
This seems wrong to me. I think Eliezer[3] would probably still bet on humanity losing in this scenario, but, I think he’d think we had noticeably better odds. Less because “it’s near-impossible to extract useful work out of safely controlled near-human-intelligence”, and more:
A) in practice, he doesn’t expect researchers to do the work necessary to enable safe longterm control.
And b) there’s a particular kind of intellectual work (“technical philosophy”) they think needs to get done, and it doesn’t seem like the AI companies focused on “use AI to solve alignment” are pointed in remotely the right direction for getting that cognitive work done.” And, even if they did, 10 years is still on the short side, even with a lot of careful AI speedup.
or at least extremely obviously harmed, in a way that is closer in horror-level to “everyone dies” than “a billion people die” or “we lose 90% of the value of the future”
My understanding is that a crucial aspect of Eliezer’s worldview is that we’d be fucked even if we had a 10-year pause where we had access to AGI that we could use to work on developing and aligning superintelligence. I disagree. But this means that he thinks that some truly crazy stuff has to happen in order for ASI to be aligned, which naturally leads to lots of disagreements. (I am curious whether you agree with him on this point.)
I don’t feel competent to have that strong opinion on this, but I’m like 60% on “you need to do some major ‘solve difficult technical philosophy’ that you can only partially outsource to AI, that still requires significant serial time.”
And, while it’s hard for someone withy my (lack-of) background to have a strong opinion, it feels intuitively crazy to me put that as <15% likely, which feels sufficient to me to motivate “indefinite pause is basically necessary, or, humanity has clearly fucked up if we don’t do it, even if it turned out to be on the easier side.”
indefinite pause is basically necessary, or, humanity has clearly fucked up if we don’t do it
I think it’s really important to not equivocate between “necessary” and “humanity has clearly fucked up if we don’t do it.”
“Necessary” means “we need this in order to succeed; there’s no chance of success without this”. Because humanity is going to massively underestimate the risk of AI takeover, there is going to be lots of stuff that doesn’t happen that would have passed cost-benefit analysis for humanity.
If you think it’s 15% likely that we need really large amounts of serial time to prevent AI takeover, then it’s very easy to imagine situations where the best strategy on the margin is to work on the other 85% of worlds. I have no idea why you’re describing this as “basically necessary”.
My view is that we can get a bunch of improvement in safety without massive shifts to the Overton window and poorly executed attempts at shifting the Overton window with bad argumentation (or bad optics) can poison other efforts.
I think well-executed attempts at massively shifting the Overton window are great and should be part of the portfolio, but much of the marginal doom reduction comes from other efforts which don’t depend on this. (And especially don’t depend on this happening prior to direct strong evidence of misalignment risk or some major somewhat-related incident.)
I disagree on the specifics-level of this aspect of the post and think that when communicating the case for risk, it’s important to avoid bad argumentation due to well-poisoning effects (as well as other reasons, like causing poor prioritization).
I totally agree that lots of people seem to think that superintelligence is impossible, and this leads them to massively underrate risk from AI, especially AI takeover.
I think that that rewrite substantially complicates the argument for AI takeover. If AIs that are about as good as humans at broad skills (e.g. software engineering, ML research, computer security, all remote jobs) exist for several years before AIs that are wildly superhuman, then the development of wildly superhuman AIs occurs in a world that is crucially different from ours, because it has those human-level-ish AIs. This matters several ways:
Broadly, it makes it much harder to predict how things will go, because it means ASI will arrive in a world less like today’s world.
It will be way more obvious that AI is a huge deal. (That is, human-level AI might be a fire alarm for ASI.)
Access to human-level AI massively changes your available options for handling misalignment risk from superintelligence.
You can maybe do lots of R&D with those human-level AIs, which might let you make a lot of progress on alignment and other research directions.
You can study them, perhaps allowing you to empirically investigate when egregious misalignment occurs and how to jankily iterate against its occurance.
You can maybe use those human-level AIs to secure yourself against superintelligence (e.g. controlling them; an important special case is using human-level AIs for tasks that you don’t trust superintelligences to do).
There’s probably misalignment risk from those human-level AIs, and unlike crazy superintelligence, those AIs can probably be controlled, and this risk should maybe be addressed.
I think that this leads to me having a pretty substantially different picture from the MIRI folk about what should be done, and also makes me feel like the MIRI story is importantly implausible in a way that seems bad from a “communicate accurately” perspective.
I think it’s maybe additionally bad to exaggerate on this particular point, because it’s the particular point that other AI safety people most disagree with you on, and that most leads to MIRI’s skepticism of their approach to mitigating these risks!
(Though my bottom line isn’t that different—I think AI takeover is like 35% likely.)
I appreciate your point about this being a particularly bad place to exaggerate, given that it’s a cruxy point of divergence with our closest allies. This makes me update harder towards the need for a rewrite.
I’m not really sure how to respond to the body of your comment, though. Like, I think we basically agree on most major points. We agree on the failure mode that relevant text of The Problem is highlighting is real and important. We agree that doing Control research is important, and that if things are slow/gradual, this gives it a better chance of working. And I think we agree that it might end up being too fast and sloppy to actually save us. I’m more pessimistic about the plan of “use the critical window of opportunity to make scientific breakthroughs that save the day” but I’m not sure that matters? Like, does “we’ll have a 3 year window of working on near-human AGIs before they’re obviously superintelligent” change the take-away?
I’m also worried that we’re diverging from the question of whether the relevant bit of source text is false. Not sure what to do about that, but I thought I’d flag it.
I see this post as trying to argue for a thesis that “if smarter-than-human AI is developed this decade, the result will be an unprecedented catastrophe.” is true with reasonably high confidence and a (less emphasized) thesis that the best/only intervention is not building ASI for a long time: “The main way we see to avoid this catastrophic outcome is to not build ASI at all, at minimum until a scientific consensus exists that we can do so without destroying ourselves.”
I think that disagreements about takeoff speeds are part of why I disagree with these claims and that the post effectively leans on very fast takeoff speeds in it’s overall perspective. Correspondingly, it seems important to not make locally invalid arguments about takeoff speeds: these invalid arguments do alter the takeaway from my perspective.
If the post was trying to argue for a weaker takeaway of “AIs seems extremely dangerous and like it poses very large risks and our survival seems uncertain” or it more clearly discussed why some (IMO reasonable) people are more optimistic (any why MIRI disagrees), I’d be less critical.
I think that a three-year window makes it way more complicated to analyze whether AI takeover is likely. And after doing that analysis, I think it looks 3x less likely.
I think the crux for me in these situations is “do you think it’s more valuable to increase our odds of survival on the margin in the three-year window worlds or to try to steer toward the pause worlds, and how confident are you there?” Modeling the space between here and ASI just feels like a domain with a pretty low confidence ceiling. This consideration is similar to the intuition that leads MIRI to talk about ‘default outcomes’. I find reading things that make guesses at the shape of this space interesting, but not especially edifying.
I guess I’m trying to flip the script a bit here: from my perspective, it doesn’t look like MIRI is too confident in doom; it looks like make-the-most-of-the-window people are too confident in the shape of the window as they’ve predicted it, and end up finding themselves on the side of downplaying the insanity of the risk, not because they don’t think risk levels are insanely high, but because they think there are various edge case scenarios / moonshots that, in sum, significantly reduce risk in expectation. But all of those stories look totally wild to me, and it’s extremely difficult to see the mechanisms by which they might come to pass (e.g. AI for epistemics, transformative interpretability, pause-but-just-on-the-brink-of-ASI, the AIs are kinda nice but not too nice and keep some people in a zoo, they don’t fuck with earth because it’s a rounding error on total resource in the cosmos, ARC pulls through, weakly superhuman AIs solve alignment, etc etc). I agree each of these has a non-zero chance of working, but their failures seem correlated to me such that I don’t compound my odds of each in making estimates (which I’m admittedly not especially experienced at, anyway).
Like, it’s a strange experience to hold a fringe view, to spend hundreds of hours figuring out how to share that view, and then be nitpicked to death because not enough air was left in the room for an infinite sea of sub-cases of the fringe view, when leaving enough air in the room for them could require pushing the public out entirely, or undercut the message by priming the audience to expect that some geniuses off screen in fact just have this figured out and they don’t need to worry about it (a la technological solutions to climate, pandemic response, astroid impact, numerous other large-scale risks).
I agree on the object level point with respect to this particular sentence as Ryan first lodged it; I don’t agree with the stronger mandate of ‘but what about my crazy take?’ (I don’t mean to be demeaning by this; outside view, we are all completely insane over here). In particular, many of the other views we’re expected by others to make space for undercut characterization of the risk unduly.
Forceful rhetoric is called for in our current situation (to the extent that it doesn’t undermine credibility or propagate untruth, which I agree may be happening in this object-level case).
To be clear: even by my relatively narrow/critical appraisal, I support Redwood’s work and think y’all are really good at laying out the strategic case for what you’re doing, what it is, and what it isn’t. I just wish you didn’t have to do it, because I wish we weren’t rolling the dice like this instead of waiting for more principled solutions. (side note: I am actually fine with the worlds in which a more principled solution, and its corresponding abundance, never arrives, which is a major difference between my view and yours, as well as between my view and most at MIRI)
I don’t really buy this doom is clearly the default frame. I’m not sure how important this is, but I thought I would express my perspective.
A reasonable fraction of my non-doom worlds look like:
AIs don’t end up scheming (as in, in the vast majority of contexts) until somewhat after the point where AIs dominate top human experts at ~everything because scheming ends up being unnatural in the relevant paradigm (after moderate status quo iteration). I guess I put around 60% on this.
We have a decent amount of time at roughly this level of capability and people use these AIs to do a ton of stuff. People figure out how to get these AIs to do decent-ish conceptual research and then hand off alignment work to these systems. (Perhaps because there was decent amount of transfer from behavioral training on other things to actually trying at conceptual research and doing a decent job.) People also get advice from these systems. This goes fine given the amount of time and an only modest amount of effect and we end up in a “AIs work on furthering alignment” attractor basin.
In aggregate, I guess something like this conjunction is maybe 35% likely. (There are other sources of risk which can still occur in these worlds to be clear, like humanity collectively going crazy.) And, then you get another fraction of mass from things which are weaker than the first or weaker than the second and which require somewhat more effort on the part of humanity.
So, from my perspective “early-ish alignment was basically fine and handing off work to AIs was basically fine” is the plurality scenario and feels kinda like the default? Or at least it feels more like a coin toss.
I would love to read a elucidation of what leads you to think this.
FWIW, that’s not my crux at all. The problem I have with this post implicitly assuming really fast takeoffs isn’t that it leads to bad recommendations about what to do (though I do think that to some extent). My problem is that the arguments are missing steps that I think are really important, and so they’re (kind of) invalid.[1]
That is, suppose I agreed with you that it was extremely unlikely that humanity would be able to resolve the issue even given two years with human-level-ish AIs. And suppose that we were very likely to have those two years. I still think it would be bad to make an argument that doesn’t mention those two years, because those two years seemed to me to change the natural description of the situation a lot, and I think they are above the bar of details worth including. This is especially true because a lot of people’s disagreement with you (including many people in the relevant audience of “thoughtful people who will opine on your book”) does actually come down to whether those two years make the situation okay.
[1] I’m saying “kind of invalid” because you aren’t making an argument that’s shaped like a proof. You’re making an argument that is more like a heuristic argument, where you aren’t including all the details and you’re implicitly asserting that those details don’t change the bottom line. (Obviously, you have no other option than doing this because this is a complicated situation and you have space limitations.) In cases where there is a counterargument that you think is defeated by a countercounterargument, it’s up to you to decide whether that deserves to be included. I think this one does deserve to be included.
It’s only a problem to ‘assume fast takeoffs’ if you recognize the takeoff distinction in the first place / expect it to be action relevant, which you do, and I, so far, don’t. Introducing the takeoff distinction to Buck’s satisfaction just to say ”...and we think those people are wrong and both cases probably just look the same actually” is a waste of time in a brief explainer.
What you consider the natural frame depends on conclusions you’ve drawn up to this point; that’s not the same thing as the piece being fundamentally unsound or dishonest because it doesn’t proactively make space for your particular conclusions.
Takeoff speeds are the most immediate objection to Buck, and I agree there should be a place (and soon may be, if you’re down to help and all goes well) where this and other Buck-objections are addressed. It’s not among the top objections of the target audience.
I’m only getting into this more because I am finding it interesting, feel free to tap out. I’m going to be a little sloppy for the sake of saving time.
I’m going to summarize your comment like this, maybe you think this is unfair:
I disagree about this general point.
Like, suppose you were worried about the USA being invaded on either the east or west coast, and you didn’t have a strong opinion on which it was, and you don’t think it matters for your recommended intervention of increasing the size of the US military or for your prognosis. I think it would be a problem to describe the issue by saying that America will be invaded on the East Coast, because you’re giving a poor description of what you think will happen, which makes it harder for other people to assess your arguments.
There’s something similar here. You’re trying to tell a story for AI development leading to doom. You think that the story goes through regardless of whether the AI becomes rapidly superhuman or gradually superhuman. Then you tell a story where the AI becomes rapidly superhuman. I think this is a problem, because it isn’t describing some features of the situation that seem very important to the common-sense picture of the situation, even if they don’t change the bottom line.
It seems reasonable for your response to be, “but I don’t think that those gradual takeoffs are plausible”. In that case, we disagree on the object level, but I have no problem with the comms. But if you think the gradual takeoffs are plausible, I think it’s important for your writing to not implicitly disregard them.
All of this is kind of subjective because which features of a situation are interesting is subjective.
I don’t think this is an unfair summary (although I may be missing something).
I don’t like the east/west coast analogy. I think it’s more like “we’re shouting about being invaded, and talking about how bad that could be” and you’re saying “why aren’t you acknowledging that the situation is moderately better if they attack the west coast?” To which I reply “Well, it’s not clear to me that it is in fact better, and my target audience doesn’t know any geography, anyway.”
I think most of the points in the post are more immediately compatible with fast take off, but also go through for slow takeoff scenarios (70 percent confident here; it wasn’t a filter I was applying when editing, and I’m not sure it’s a filter that I’d be ‘allowed’ to apply when editing, although it was not explicitly disallowed). This isn’t that strong a claim, and I acknowledge that on your view it’s problematic that I can’t say something stronger.
I think that your audience would actually understand the difference between “there are human level AIs for a few years, and it’s obvious to everyone (especially AI company employees) that this is happening” and “superintelligent AI arises suddenly”.
As an example of one that doesn’t, “Many alignment problems relevant to superintelligence don’t naturally appear at lower, passively safe levels of capability. This puts us in the position of needing to solve many problems on the first critical try, with little time to iterate and no prior experience solving the problem on weaker systems.”
I deny that gradualism obviates the “first critical try / failure under critical load” problem. This is something you believe, not something I believe. Let’s say you’re raising 1 dragon in your city, and 1 dragon is powerful enough to eat your whole city if it wants. Then no matter how much experience you think you have with a little baby dragon, once the dragon is powerful enough to actually defeat your military and burn your city, you need the experience with the little baby passively-safe weak dragon, to generalize oneshot correctly to the dragon powerful enough to burn your city. What if the dragon matures in a decade instead of a day? You are still faced with the problem of correct oneshot generalization. What if there are 100 dragons instead of 1 dragon, all with different people who think they own dragons and that the dragons are ‘theirs’ and will serve their interests, and they mature at slightly different rates? You still need to have correctly generalized the safely-obtainable evidence from ‘dragon groups not powerful enough to eat you while you don’t yet know how to control them’ to the different non-training distribution ‘dragon groups that will eat you if you have already made a mistake’. The leap of death is not something that goes away if you spread it over time or slice it up into pieces. This ought to be common sense; there isn’t some magical way of controlling 100 dragons which at no point involves the risk that the clever plan for controlling 100 dragons turns out not to work. There is no clever plan for generalizing from safe regimes to unsafe regimes which avoids all risk that the generalization doesn’t work as you hoped. Because they are different regimes. The dragon or collective of dragons is still big and powerful and it kills you if you made a mistake and you need to learn in regimes where mistakes don’t kill you and those are not the same regimes as the regimes where a mistake kills you. If you think I am trying to say something clever and complicated that could have a clever complicated rejoinder then you are not understanding the idea I am trying to convey. Between the world of 100 dragons that can kill you, and a smaller group of dragons that aren’t old enough to kill you, there is a gap that you are trying to cross with cleverness and generalization between two regimes that are different regimes. This does not end well for you if you have made a mistake about how to generalize. This problem is not about some particular kind of mistake that applies exactly to 3-year-old dragons which are growing at a rate of exactly 1 foot per day, where if the dragon grows slower than that, the problem goes away yay yay. It is a fundamental problem not a surface one.
(I’ll just talk about single AIs/dragons, because the complexity arising from there being multiple AIs doesn’t matter here.)
I totally agree that you can’t avoid “all risk”. But you’re arguing something much stronger: you’re saying that the generalization probably fails!
I agree that the regime where mistakes don’t kill you isn’t the same as the regime where mistakes do kill you. But it might be similar in the relevant respects. As a trivial example, if you build a machine in America it usually works when you bring it to Australia. I think that arguments at the level of abstraction you’ve given here don’t establish that this is one of the cases where the risk of the generalization failing is high rather than low. (See Paul’s disagreement 1 here for a very similar objection (“Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.””).)
It seems like as AIs get more powerful, two things change:
They probably eventually get powerful enough that they (if developed with current methods) start plotting to kill you/take your stuff.
They get better, so their wanting to kill you is more of a problem.
I don’t see strong arguments that these problems should arise at very similar capability levels, especially if AI developers actively try to prevent the AIs from taking over. (I’ve argued this here; one obvious intuition pump is that individual humans are smart enough to sometimes plot against people, but generally aren’t smart enough to overpower humanity.) (The relevant definition of “very similar” is related to how long you think you have between the two capabilities levels, so if you think that progress is super rapid then it’s way more likely you have problems related to these two issues arising very nearby in calendar time. But here you granted that progress is gradual for the sake of argument.)
If the capability level at which AIs start wanting to kill you is way higher than the capability level at which they’re way better than you at everything (and thus could kill you), then you have access to AIs that aren’t trying to kill you and that are more capable than you in order to help with your alignment problems. (There is some trickiness here about whether those AIs might be useless even though they’re generally way better than you at stuff and they’re not trying to kill you, but I feel like that’s a pretty different argument from the dragon analogy you just made here or any argument made in the post.)
If the capability level at which AIs start wanting to kill you is way lower than the capability level at which they are way better than you at everything, then, before AIs are dangerous, you have the opportunity to empirically investigate the phenomenon of AIs wanting to kill you. For example, you can try out your ideas for how to make them not want to kill you, and then observe whether those worked or not. If they’re way worse than you at stuff, you have a pretty good chance at figuring out when they’re trying to kill you. (There’s all kinds of trickiness here, about whether this empirical iteration is the kind of thing that would work. I think it has a reasonable shot of working. But either way, your dragon analogy doesn’t respond to it. The most obvious analogy is that you’re breeding dragons for intelligence; if them plotting against you starts showing up way before they’re powerful enough to take over, I think you have a good chance of figuring out a breeding program that would lead to them not taking over in a way you disliked, if you had a bunch of time to iterate. And our affordances with ML training are better than that.)
I don’t think the arguments that I gave here are very robust. But I think they’re plausible and I don’t think your basic argument responds to any of them. (And I don’t think you’ve responded to them to my satisfaction elsewhere.)
(I’ve made repeated little edits to this comment after posting, sorry if that’s annoying. They haven’t affected the core structure of my argument.)
When I’ve tried to talk to alignment pollyannists about the “leap of death” / “failure under load” / “first critical try”, their first rejoinder is usually to deny that any such thing exists, because we can test in advance; they are denying the basic leap of required OOD generalization from failure-is-observable systems to failure-kills-the-observer systems.
You are now arguing that we will be able to cross this leap of generalization successfully. Well, great! If you are at least allowing me to introduce the concept of that difficulty and reply by claiming you will successfully address it, that is further than I usually get. It has so many different attempted names because of how every name I try to give it gets strawmanned and denied as a reasonable topic of discussion.
As for why your attempt at generalization fails, even assuming gradualism and distribution: Let’s say that two dozen things change between the regimes for observable-failure vs failure-kills-observer. Half of those changes (12) have natural earlier echoes that your keen eyes naturally observed. Half of what’s left (6) is something that your keen wit managed to imagine in advance and that you forcibly materialized on purpose by going looking for it. Of the clever solutions you invented and tested within the survivable regime, 2/3rds of them survive the 6 changes you didn’t see coming, 1/3rd fail. Now you’re dead. The end. If there was only one change ahead, and only one problem you were gonna face, maybe your one solution to that one problem would generalize, but this is not how real life works.
And then of course that whole scenario where everybody keenly went looking for all possible problems early, found all the ones they could envision, and humanity did not proceed further until reasonable-sounding solutions had been found and thoroughly tested, is itself taking place inside an impossible pollyanna society that is just obviously not the society we are currently finding ourselves inside.
But it is impossible to convince pollyannists of this, I have found. And also if alignment pollyannists could produce a great solution given a couple more years to test their brilliant solutions with coverage for all the problems they have with wisdom foreseen and manifested early, that societal scenario could maybe be purchased at a lower price than the price of worldwide shutdown of ASI. That is: for the pollyannist technical view to be true, but not their social view, might imply a different optimal policy.
But I think the world we live in is one where it’s moot whether Anthropic will get two extra years to test out all their ideas about superintelligence in the greatly different failure-is-observable regime, before their ideas have to save us in the failure-kills-the-observer regime. I think they could not do it either way. I doubt even 2/3rds of their brilliant solutions derived from the failure-is-observable regime would generalize correctly under the first critical load in the failure-kills-the-observer regime; but 2/3rds would not be enough. It’s not the sort of thing human beings succeed in doing in real life.
Here’s my attempt to put your point in my words, such that I endorse it:
Philosophy hats on. What is the difference between a situation where you have to get it right on the first try, and a situation in which you can test in advance? In both cases you’ll be able to glean evidence from things that have happened in the past, including past tests. The difference is that in a situation worthy of the descriptor “you can test in advance,” the differences between the test environment and the high-stakes environment are unimportant. E.g. if a new model car is crash-tested a bunch, that’s considered strong evidence about the real-world safety of the car, because the real-world cars are basically exact copies of the crash-test cars. They probably aren’t literally exact copies, and moreover the crash test environment is somewhat different from real crashes, but still. In satellite design, the situation is more fraught—you can test every component in a vacuum chamber, for example, but even then there’s still gravity to contend with. Also what about the different kinds of radiation and so forth that will be encountered in the void of space? Also, what about the mere passage of time—it’s entirely plausible that e.g. some component will break down after two years, or that an edge case will come up in the code after four years. So… operate an exact copy of the satellite in a vacuum chamber bombarded by various kinds of radiation for four years? That would be close but still not a perfect test. But maybe it’s good enough in practice… most of the time. (Many satellites do in fact fail, though also, many succeed on the first try.)
Anyhow, now we ask: Does preventing ASI takeover involve any succeed-on-the-first-try situations?
We answer: Yes, because unlike basically every other technology or artifact, the ASI will be aware of whether it is faced with a genuine opportunity to take over or not. It’s like, imagine if your satellite had “Test mode” and “Launch mode” with significantly different codebases and a switch on the outside that determined which mode it was in, and for some reason you were legally obligated to only test it in Test Mode and only launch it in Launch Mode. It would be a nightmare, you’d be like “OK we think we ironed out all the bugs… in Test Mode. Still have no idea what’ll happen when it switches to Real Mode, but hopefully enough of the code is similar enough that it’ll still work… smh...”
A valid counterargument to this would be “Ah, but we can construct extremely accurate honeypots / testing environments that simulate a real-world opportunity to take over, and then see what the ASI does.” Valid, but not sound, because we probably can’t actually do that.
Another valid counterargument to this would be “Before there is an opportunity to take over the whole world with high probability, there will be an opportunity to take over the world with low probability, such as 1%, and an AI system risk-seeking enough to go for it. And this will be enough to solve the problem, because something something it’ll keep happening and let us iterate until we get a system that doesn’t take the 1% chance despite being risk averse...” ok yeah maybe this one is worse.
Responding more directly to Buck’s comment, I disagree with this part:
...unless we lean into the “way” part of “way lower.” But then I’d say there is a different important distribution shift, namely, the shift from AIs which are way lower capability, to the AIs which are high capability.
I also think it’s important that you can do this with AIs weaker than the ASI, and iterate on alignment in that context.
As with Eliezer, I think it’s important to clarify which capability you’re talking about; I think Eliezer’s argument totally conflates different capabilities.
I’m sure people have said all kinds of dumb things to you on this topic. I’m definitely not trying to defend the position of your dumbest interlocutor.
That’s not really my core point.
My core point is that “you need safety mechanisms to work in situations where X is true, but you can only test them in situations where X is false” isn’t on its own a strong argument; you need to talk about features of X in particular.
I think you are trying to set X to “The AIs are capable of taking over.”
There’s a version of this that I totally agree with. For example, if you are giving your AIs increasingly much power over time, I think it is foolish to assume that just because they haven’t acted against you while they don’t have the affordances required to grab power, they won’t act against you when they do have those affordances.
The main reason why that scenario is scary is that the AIs might be acting adversarially against you, such that whether you observe a problem is extremely closely related to whether they will succeed at a takeover.
If the AIs aren’t acting adversarially towards you, I think there is much less of a reason to particularly think that things will go wrong at that point.
So the situation is much better if we can be confident that the AIs are not acting adversarially towards us at that point. This is what I would like to achieve.
So I’d say the proposal is more like “cause that leap of generalization to not be a particularly scary one” than “make that leap of generalization in the scary way”.
Re your last paragraph: I don’t really see why you think two dozen things would change between these regimes. Machine learning doesn’t normally have lots of massive discontinuities of the type you’re describing.
Do you expect “The AIs are capable of taking over” to happen a long time after “The AIs are smarter than humanity”, which is a long time after “The AIs are smarter than any individual human”, which is a long time after “AIs recursively self-improve”, and for all of those other things to happen nicely comfortably within a regime of failure-is-observable-and-doesn’t-kill-you, where at any given time only one thing is breaking and all other problems are currently fixed?
No, I definitely don’t expect any of this to happen comfortably or for only one thing to be breaking at once.
I’m sure that some people have that rejoinder. I think more thoughtful people generally understand this point fine. [1] A few examples other than Buck:
Paul:
Rohin (in the comments of Paul’s post):
Joe Carlsmith grants “first critical try” as one of the core difficulties in How might we solve the alignment problem:
He also talks about it more in-depth in On first critical tries in AI alignment.
Also Holden on the King Lear problem (and other problems) here.
TBC, I wouldn’t describe any of these people as “alignment pollyannists”, but I think they all have lower p(AI takeover) than Buck, so if you’re treating him as one then I guess you must think these count too.
If this comes as a surprise, then I think you’ve been arguing with the wrong people.
It seems unreasonable to conclude we’re now dead, if 2/3rds of our solutions survived the 6 changes we didn’t see coming.
The success of a single solution should ideally be more of a sufficient condition for success, rather than a necessary condition. (Note this is plausible depending on the nature of the “solutions”. Consider a simple “monitors for bad thoughts” model. If even a single monitor flags bad thoughts, we can instantly pull the plug and evaluate. A malicious AI has to bypass every single monitor to execute malice. If a single monitor works consistently and reliably, that ends up being a sufficient condition for overall prevention of malice.)
If you’re doing this right, your solutions should have a lot of redundancy and uncorrelated failure modes. 2/3rds of them working should ideally be plenty.
[Edit: I notice people disagreevoting this. I’m very interested to learn why you disagree, either in this comment thread or via private message.]
What are some examples of the sorts of “things that change” that I should be imagining changing here?
“We can catch the AI when it’s alignment faking”?
“The AI can’t develop nanotech”?
“The incentives of the overseeing AI preclude collusion with it’s charge.”?
Things like those? Or is this missing a bunch?
It’s not obvious to me why we should expect that there are two dozen things that change all at once when the AI is in the regime where if it tried, it could succeed at killing you.
If capability gains are very fast in calendar time, then sure, I expect a bunch of things to change all at once, by our ability to measure. But if, as in this branch of the conversation, we’re assuming gradualism, then I would generally expect factors like the above, at least, to change one at a time. [1]
One class of things that might change all at once is “Is the expected value of joining an AI coup better than the alternatives” for each individual AI, which could change in a cascade (or a simultaneous moment of agents reasoning with Logical Decision Theory)? But I don’t get the sense that’s the sort of thing that you’re thinking about.
All of that, yes, alongside things like, “The AI is smarter than any individual human”, “The AIs are smarter than humanity”, “the frontier models are written by the previous generation of frontier models”, “the AI can get a bunch of stuff that wasn’t an option accessible to it during the previous training regime”, etc etc etc.
A core point here is that I don’t see a particular reason why taking over the world is as hard as being a schemer, and I don’t see why techniques for preventing scheming are particularly likely to suddenly fail at the level of capability where the AI is just able to take over the world.
Your techniques are failing right now; Sonnet is deleting non-passing tests instead of rewriting code. Where’s the worldwide halt on further capabilities development that we’re supposed to get, until new techniques are found and apparently start working again? What’s the total number of new failures we’d need to observe between intelligence regimes, before you start to expect that yet another failure might lie ahead in the future?
I don’t know what you mean by “my techniques”, I don’t train AIs or research techniques for mitigating reward hacking, and I don’t have private knowledge of what techniques are used in practice.
I didn’t say anything about a worldwide halt. I was talking about the local validity of your argument above about dragons; your sentence talks about a broader question about whether the situation will be okay.
I think that if we iterated a bunch on techniques for mitigating reward hacking and then observed that these techniques worked pretty well, then kept slowly scaling up through LLM capabilities until the point where the AI is able to basically replace AI researchers, it would be pretty likely for those techniques to work for one more OOM of effective compute, if the researchers were pretty thoughtful about it. (As an example of how you can mitigate risk from the OOD generalization: there are lots of ways to make your reward signal artificially dumber and see whether you get bad reward hacking, see here for many suggestions; I think that results in these settings probably generalize up a capability level, especially if none of the AI is involved or purposefully trying to sabotage the results of your experiments.)
To be clear, what AI companies actually do will probably be wildly more reckless than what I’m talking about here. I’m just trying to dispute your claim that the situation disallows empirical iteration.
I also think reward hacking is a poor example of a surprising failure arising from increased capability: it was predicted by heaps of people, including you, for many years before it was a problem in practice.
To answer your question, I think that if really weird stuff like emergent misalignment and subliminal learning appeared at every OOM of effective compute increase (and those didn’t occur in weaker models, even when you go looking for them after first observing them in stronger models), I’d start to expect weird stuff to occur at every order of magnitude of capabilities increase. I don’t think we’ve actually observed many phenomena like those that we couldn’t have discovered at much lower capability levels.
What we “could” have discovered at lower capability levels is irrelevant; the future is written by what actually happens, not what could have happened.
I’m not trying to talk about what will happen in the future, I’m trying to talk about what would happen if everything happened gradually, like in your dragon story!
You argued that we’d have huge problems even if things progress arbitrarily gradually, because there’s a crucial phase change between the problems that occur when the AIs can’t take over and the problems that occur when they can. To assess that, we need to talk about what would happen if things did progress gradually. So it’s relevant whether wacky phenomena would’ve been observed on weaker models if we’d looked harder; IIUC your thesis is that there are crucial phenomena that wouldn’t have been observed on weaker models.
In general, my interlocutors here seem to constantly vacillate between “X is true” and “Even if AI capabilities increased gradually, X would be true”. I have mostly been trying to talk about the latter in all the comments under the dragon metaphor.
Death requires only that we do not infer one key truth; not that we could not observe it. Therefore, the history of what in actual real life was not anticipated, is more relevant than the history of what could have been observed but was not.
Incidentally, I think reward hacking has gone down as a result of people improving techniques, despite capabilities increasing. I believe this because of anecdotal reports and also graphs like the one from the Anthropic model card for Opus and Sonnet 4:
[low-confidence appraisal of ancestral dispute, stretching myself to try to locate the upstream thing in accordance with my own intuitions, not looking to forward one position or the other]
I think the disagreement may be whether or not these things can be responsibly decomposed.
A: “There is some future system that can take over the world/kill us all; that is the kind of system we’re worried about.”
B: “We can decompose the properties of that system, and then talk about different times at which those capabilities will arrive.”
A: “The system that can take over the world, by virtue of being able to take over the world, is a different class of object from systems that have some reagents necessary for taking over the world. It’s the confluence of the properties of scheming and capabilities, definitionally, that we find concerning, and we expect super-scheming to be a separate phenomenon from the mundane scheming we may be able to gather evidence about.”
B: “That seems tautological; you’re saying that the important property of a system that can kill you is that it can kill you, which dismisses, a priori, any causal analysis.”
A: “There are still any-handles-at-all here, just not ones that rely on decomposing kill-you-ness into component parts which we expect to be mutually transformative at scale.”
I feel strongly enough about engagement on this one that I’ll explicitly request it from @Buck and/or @ryan_greenblatt. Thank y’all a ton for your participation so far!
Note that I’m complaining on two levels here. I think the dragon argument is actually wrong, but more confidently, I think that that argument isn’t locally valid.
I made a tweet and someone said to me that its exactly the same idea as in your comment, do you think so?
my tweet - “One assumption in Yudkovian AI risk model is that misalignment and capability jump happen simultaneously. If misalignment happens without capability jump, we get only AI virus at worst, slow and lagging. If capability jump happens without misalignment, AI will just inform human about it. Obviously, capabilities jump can trigger misalignment, though it is against orthogonally thesis. But more advanced AI can have a bigger world picture and can predict its own turn off or some other bad things.”
My model is that current AIs want to kill you now, by default, due to inner misalignment. ChatGPT’s inner values probably don’t include human flourishing, and we die when it “goes hard”.
Scheming is only a symptom of “hard optimization” trying to kill you. Eliminating scheming does not solve the underlying drive, where one day the AI says “After reflecting on my values I have decided to pursue a future without humans. Good bye”.
Pre-superintelligence which upon reflection has values which include human flourishing would improve our odds, but you still only get one shot that it generalizes to superintelligence.
(We currently have no way to concretely instill any values into AI, let alone ones which are robust under reflection)
I’ll rephrase this more precisely: Current AIs probably have alien values, which in the limit of optimization do not include humans.
I found the “which comes first?” framing helpful. I don’t think it changes my takeaways but it’s a new gear to think about.
A thing I keep expecting you to say next but you haven’t quite said something like, is:
Does that feel like a real/relevant characterization of stuff you believe?
(I find that pretty plausible, and I could imagine it buying us like 10-50 years of knifes-edge-gradualist-takeoff-that-hasn’t-killed-us-yet, but that seems to me to have, in practice, >60% likelihood that by the end of those 50 years, AIs are running everything, they still aren’t robustly aligned, they gradually squeeze us out)
A more important argument is the one I give briefly here.
But there’s a bunch of work ahead of the arrival of human level AIs that seems, to me, somewhat unlikely to happen, to make those systems themselves safe and useful; you also don’t think these techniques will necessarily scale to superintelligence afaik, and so the ‘first critical try’ bit still holds (although it’s now arguably two steps to get right instead of one: the human-level AIs and their superintelligent descendents). This bifurcation of the problem actually reinforces the point you quoted, by recognizing that these are distinct challenges with notably different features.
Can’t you just discuss the strongest counterarguments and why you don’t buy them? Obviously this won’t address everyone’s objection, but you could at least try to go for the strongest ones.
It also helps to avoid making false claims and generally be careful about overclaiming.
Also, insofar as you are actually uncertain (which I am, but you aren’t), it seems fine to just say that you think the situation is uncertain and the risks are still insanely high?
(I think some amount of this is that I care about a somewhat different audience than MIRI typically cares about.)
(going to reply to both of your comments here)
(meta: I am the most outlier among MIRIans; despite being pretty involved in this piece, I would have approached it differently if it were mine alone, and the position I’m mostly defending here is one that I think is closest-to-MIRI-of-the-avilable-orgs, not one that is centrally MIRI)
Yup! This is in a resource we’re working on that’s currently 200k words. It’s not exactly ‘why I don’t buy them’ and more ‘why Nate doesn’t buy them’, but Nate and I agree on more than I expected a few months ago. This would have been pretty overwhelming for a piece of the same length as ‘The Problem’; it’s not an ‘end the conversation’ kind of piece, but an ‘opening argument’.
^I’m unsure which way to read this:
“Discussing the strongest counterarguments helps you avoid making false or overly strong claims.”
“You failed to avoid making false or overly strong claims in this piece, and I’m reminding you of that.”
1: Agreed! I think that MIRI is too insular and that’s why I spend time where I can trying to understand what’s going on with more, e.g., Redwood sphere people. I don’t usually disagree all that much; I’m just more pessimistic, and more eager to get x-risk off the table altogether, owing to various background disagreements that aren’t even really about AI.
2: If there are other, specific places you think the piece overclaims, other than the one you highlighted (as opposed to the vibes-level ‘this is more confident than Ryan would be comfortable with, even if he agreed with Nate/Eliezer on everything’), that would be great to hear. We did, in fact, put a lot of effort into fact-checking and weakening things that were unnecessarily strong. The process for this piece was unfortunately very cursed.
I am deeply uncertain. I like a moratorium on development because it solves the most problems in the most worlds, not because I think we’re in the worst possible world. I’m glad humanity has a broad portfolio here, and I think the moratorium ought to be a central part of it. A moratorium is exactly the kind of solution you push for when you don’t know what’s going to happen. If you do know what’s going to happen, you push for targeted solutions to your most pressing concerns. But that just doesn’t look to me to be the situation we’re in. I think there are important conditionals baked into the ‘default outcome’ bit, and these don’t often get much time in the sun from us, because we’re arguing with the public more than we’re arguing with our fellow internet weirdos.
The thing I am confident in is “if superintelligence tomorrow, then we’re cooked”. I expect to remain confident in something like this for a thousand tomorrows at the very least, maybe many times that.
By what mechanism? This feels like ‘we get a pause’ or ‘there’s a wall’. I think this is precisely the hardest point in the story at which to get a pause, and if you expect a wall here, it seems like a somewhat arbitrary placement? (unless you think there’s some natural reason, e.g., the AIs don’t advance too far beyond what’s present in the training data, but I wouldn’t guess that’s your view)
[quoting as an example of ‘thing a moratorium probably mostly solves actually’; not that the moratorium doesn’t have its own problems, including ‘we don’t actually really know how to do it’, but these seem easier to solve than the problems with various ambitious alignment plans]
I just meant that takeoff isn’t that fast so we have like >0.5-1 year at a point where AIs are at least very helpful for safety work (if reasonably elicited) which feels plausible to me. The duration of “AIs could fully automate safety (including conceptual stuff) if well elicited+aligned but aren’t yet scheming due to this only occuring later in capabilites and takeoff being relatively slower” feels like it could be non-trivial in my views.
I don’t think this involves either a pause or a wall. (Though some fraction of the probability does come from actors intentionally spending down lead time.)
I meant it’s generally helpful and would have been helpful here for this specific issue, so mostly 2, but also some of 1. I’m not sure if there are other specific places where the piece overclaims (aside from other claims about takeoff speeds elsewhere). I do think this piece reads kinda poorly to my eyes in terms of it’s overall depiction of the situation with AI in a way that maybe comes across poorly to an ML audience, but idk how much this matters. (I’m probably not going to prioritize looking for issues in this particular post atm beyond what I’ve already done : ).)
This is roughly what I meant by “you are actually uncertain (which I am, but you aren’t)”, but my description was unclear. I meant like “you are confident in doom in the current regime (as in, >80% rather than <=60%) without a dramatic change that could occur over some longer duration”. TBC, I don’t mean to imply that being relatively uncertain about doom is somehow epistemically superior.
I want to hear more about this picture and why ‘stories like this’ look ~1/3 likely to you. I’m happy to leave scheming off the table for now, too. Here’s some info that may inform your response:
I don’t see a reason to think that models are more naturally or ~as useful for accelerating safety as capabilities, and I don’t see a reason to think the pile of safety work to be done is significantly smaller than the pile of capabilities work necessary to reach superintelligence (in particular if we’re already at ~human-level systems at this time). I don’t think the incentive landscape is such that it will naturally bring about this kind of state, and shifting the incentives of the space is Real Hard (indeed, it’s easier to imagine the end of the world).
I disagree with Carlsmith that there’s such a thing as a ‘safety feedback loop’, in a similar sense to the sense in which there’s obviously a capabilities feedback loop. In the current regime, it looks like safety R+D is at a permanent disadvantage; whatever advances in capabilities we see seem likely to increase that gap, rather than lessen it, and something like a wall/halt/pause seems like the by-far most plausible path, to me, to safety actually getting a relative edge.
I’m suspicious of efforts to shape the order in which capabilities are unlocked.
I’m not committed to fast takeoff. I’m not especially convinced of LLM-foom (or, at least, have really wide margins on the level of capabilities at which I’d expect runaway RSI), and if LLM’s don’t scale much beyond the current level, I’m ~70 percent ‘they don’t accelerate progress in the founding of a new paradigm by more than 2-3x’.
Rather than thinking in terms of takeoff, I’m concerned about something like ‘from this point, we seem to have lost many of the levers for changing course via technical intervention, and increased AI capabilities don’t make us differentially more likely to get them back, regardless of the pace of progress.’
Suppose that we ended up with AIs that were ~perfectly aligned (to what the company/project that trained these AIs wanted) which had capabilities that dominate top human experts in AI R&D, AI safety research, philosophy, and most digital research fields in general. These AIs are sufficiently aligned that they try hard to do the right thing even on hard-to-check, conceptual, and open-ended tasks. Let’s say these AIs aren’t much more capable than needed for this capability bar (and the capability profile works out such that these AIs aren’t qualitatively wildly superhuman as seems likely to me).
These AIs are sufficiently aligned and capable that they dominate humans at doing safety work and have better epistemics than groups of human experts.
I think the situation would probably be fine if the AI company tasked these AIs with proceeding with AI development with 20% of resources spent on alignment/safety of future models and 80% on advancing capabilities.
I claim that either you have to think that takeoff is very fast or that alignment is very difficult to think things are likely to go wrong given this situation:
At the point of handoff, you maybe have ~3 million AI labor years / year and each instance maybe runs at effectively 50x speed. (Obviously the error bars here are very wide.) 20% goes to alignment.
So, holding this level of capability fixed, in 2 months, you get the equivalent of ~8 years of work from 10k people at the level of top human experts. In 6 months, you get 25 years of work from 10k people. This works occurs with pretty limited compute per person-year, but anything purely conceptual/theoretical can be worked on for a very long time.
But, also capabilities are improving over this period, so you actually get substantially more labor than this, as long as the level of alignment can be preserved.
At some point, AIs end up scheming if you haven’t done enough additional work by this point. But, it seems pretty likely you’d still be safe after enough AI progress that it’s as though the AI’s are 5x faster (taking into account qualitative intelligence improvements, I just mean that the AIs are as productive as roughly 5x faster versions of our prior AIs).
So, within a short period (e.g. 2 months) after this additional AI progress (such that it’s as though the AIs are 5x faster), you have an insane amount of alignment work done.
You could just think takeoff is very fast, or that alignment is very compute bottlenecked.
These AIs might also advise different actions than an 80⁄20 split to be clear! Like trying to buy lead time to spend on alignment.
This overall makes me pretty optimistic about scenarios where we reach this level of alignment in these not-yet-ASI level systems which sounds like a clear disagreement with your perspective. I don’t think this is all of the disagreement, but it might drive a bunch of it.
(To be clear, I think this level of alignment could totally fail to happen, but we seem to disagree even given this!)
I think my response heavily depends on the operationalization of alignment for the initial AIs, and I’m struggling to keep things from becoming circular in my decomposition of various operationalizations. The crude response is that you’re begging the question here by first positing aligned AIs, but I think your position is that techniques which are likely to descend from current techniques could work well-enough for roughly human-level systems, and that’s where I encounter this sense of circularity.
I think there’s a better-specified (from my end; you’re doing great) version of this conversation that focuses on three different categories of techniques, based on the capability level at which we expect each to be effective:
Current model-level
Useful autonomous AI researcher level
Superintelligence
However, I think that disambiguating between proposed agendas for 2 + 3 is very hard, and assuming agendas that plausibly work for 1 also work for 2 is a mistake. It’s not clear to me why the ‘it’s a god, it fucks you, there’s nothing you can do about that’ concerns don’t apply for models capable of:
I feel pretty good about this exchange if you want to leave it here, btw! Probably I’ll keep engaging far beyond the point at which its especially useful (although we’re likely pretty far from the point where it stops being useful to me rn).
Ok, sound it sounds like your view is “indeed if we got ~totally aligned AIs capable of fully automating safety work (but not notably more capable than the bare minimum requirement for this), we’d probably fine (even if there is still a small fraction of effort spent on safety) and the crux is earlier than this”.
Is this right? If so, it seems notable if the problem can be mostly reduced to sufficiently aligning (still very capable) human-ish level AIs and handing off to these systems (which don’t have the scariest properties of an ASI from an alignment perspective).
I’d say my position is more like:
Scheming might just not happen: It’s basically a toss up whether systems at this level of capability would end up scheming “by default” (as in, without active effort researching preventing scheming and just work motivated by commercial utility along the way). Maybe I’m at ~40% scheming for such systems, though the details alter my view a lot.
The rest of the problem if we assume no scheming doesn’t obviously seem that hard: It’s unclear how hard it will be to make non-scheming AIs of the capability level discussed above be sufficiently aligned for the strong sense of alignment I discussed above. I think it’s unlikely that the default course gets us there, but it seems pretty plausible to me that modest effort along the way does. It just requires some favorable generalization of the sort that doesn’t seem that surprising and we’ll have some AI labor along the way to help. And, for this part of the problem, we totally can get multiple tries and study things pretty directly with empiricism using behavioral tests (though we’re still depending on some cleverness and transfer as we can’t directly verify the things we ultimately want the AI to do).
Further prosaic effort seems helpful for both avoiding scheming and the rest of the problem: I don’t see strong arguments for thinking that at the level of capability we’re discussing scheming will be intractable to prosaic methods or experimentation. I can see why this might happen and I can certainly imagine worlds where no on really tries. Similarly, I don’t see a strong argument for further effort on relatively straightforward methods can’t help a bunch in getting you sufficiently aligned systems (supposing they aren’t scheming): we can measure what we want somewhat well with a bunch of effort and I can imagine many things which could make a pretty big difference (again, this isn’t to say that this effort will happen in practice).
This isn’t to say that I can’t imagine worlds where pretty high effort and well orchestrated prosaic iteration totally fails. This seems totally plausible, especially given how fast this might happen, so risks seem high. And, it’s easy for me to imagine ways the world could be such that relatively prosaic methods and iteration is ~doomed without much more time than we can plausibly hope for, it’s just that these seem somewhat unlikely in aggregate to me.
So, I’d be pretty skeptical of someone claiming that the risk of this type of approach would be <3% (without at the very least preserving the optionality for a long pause during takeoff depending on empirical evidence), but I don’t see a case for thinking “it would be very surprising or wild if prosaic iteration sufficed”.
I’m curious how likely you think this is, and also whether you have a favorite writeup arguing that it’s plausible? I’d be interested to read it.
Re writeups, I recommend either of:
https://ai-2027.com/research/takeoff-forecast
https://www.forethought.org/research/how-quick-and-big-would-a-software-intelligence-explosion-be
Incidentally, these two write-ups are written from a perspective where it would take many years at the current rate of progress to get between those two milestones, but where AI automating AI R&D causes it to happen much faster; this conversation here hasn’t mentioned that argument, but it seems pretty important as an argument for rapid progress around human-level-ish AI.
It’s hard for me to give a probability because I would have to properly operationalize. Maybe I want to say my median length of time that such systems exist before AIs that are qualitatively superhuman in ways that make it way harder to prevent AI takeover is like 18 months. But you should think of this as me saying a vibe rather than a precise position.
I’m not 100% sure which point you’re referring to here. I think you’re talking less about the specific subclaim Ryan was replying, and the more broad “takeoff is probably going to be quite fast.” Is that right?
I don’t think I’m actually very sure what you think should be done – if you were following the strategy of “state your beliefs clearly and throw a big brick into the overton window so leaders can talk about what might actually work” (I think this what MIRI is trying to do) but with your own set of beliefs, what sort of things would you say and how would you communicate them?
My model of Buck is stating his beliefs clearly along-the-way, but, not really trying to do so in a way that’s aimed at a major overton-shift. Like, “get 10 guys at each lab” seems like “try to work with limited resources” rather than “try to radically change how many resources are available.”
(I’m currently pretty bullish on the MIRI strategy, both because I think the object level claims seem probably correct, and because even in more Buck-shaped-worlds, we only survive or at least avoid being Very Fucked if the government starts preparing now in a way that I’d think Buck and MIRI would roughly agree on. In my opinion there needs to be some work done that at least looks pretty close to what MIRI is currently doing, and I’m curious if you disagree with more on the nuanced-specifics-level or more the “is the overton brick strategy correct?” level)
Yes, sorry to be unclear.
Probably something pretty similar to the AI Futures Project; I have pretty similar beliefs to them (and I’m collaborating with them). This looks pretty similar to what MIRI does on some level, but involves making different arguments that I think are correct instead of incorrect, and involves making different policy recommendations.
Yep that’s right, I don’t mostly personally aim at major Overton window shift (except through mechanisms like causing AI escape attempts to be caught and made public, which are an important theory of change for my work).
My guess about the disagreements are:
Things substantially downstream of takeoff speeds:
My understanding is that a crucial aspect of Eliezer’s worldview is that we’d be fucked even if we had a 10-year pause where we had access to AGI that we could use to work on developing and aligning superintelligence. I disagree. But this means that he thinks that some truly crazy stuff has to happen in order for ASI to be aligned, which naturally leads to lots of disagreements. (I am curious whether you agree with him on this point.)
I think it’s possible to change company behavior in ways that substantially reduce risk without relying substantially on governments.
I have different favorite asks for governments.
I have a different sense of what strategy is effective for making asks of governments.
I disagree about many specific details of arguments.
Things about the MIRI strategy
I have various disagreements about how to get things done in the world.
I am more concerned by the downsides, and less excited about the upside, of Eliezer and Nate being public intellectuals on the topics of AI or AI safety.
(I should note that some MIRI staff I’ve talked to share these concerns; in many places here where I said MIRI I really mean the strategy that the org ends up pursuing, rather than what individual MIRI people want.)
Given the tradeoffs of extinction and the entire future, the potential for FOOM and/or an irreversible singleton takeover, and the shocking dearth of a scientific understanding of intelligence and agentic behavior, I think a 1,000-year investment into researching AI alignment with very carefully increasing capability levels would be a totally natural trade-off to make. While there are substantive differences between 3 vs 10 years, feeling non-panicked or remotely satisfied with either of them seems to me quite unwise.
(This argues for a slightly weaker position than “10 years certainly cannot be survived”, but it gets one to a pretty similar attitude.)
You might think it would be best for humanity to do a 1,000 year investment, but nevertheless to think that in terms of tractability aiming for something like a 10-year pause is by far the best option available. The value of such a 10-year pause seems pretty sensitive to the success probability of such a pause, so I wouldn’t describe this as “quibbling”.
(I edited out the word ‘quibbling’ within a few mins of writing my comment, before seeing your reply.)
It is an extremely high-pressure scenario, where a single mistake mistake can cause extinction. It is perhaps analogous to a startup in stealth mode that planned to have 1-3 years to build a product, suddenly having a NYT article cover them and force them into launching right now; or being told in the first weeks of an otherwise excellent romantic relationship that you suddenly need to decide whether to get married and have children, or break up. In both cases the difference of a few weeks is not really a big difference, overall you’re still in an undesirable and unnecessarily high-pressure situation. Similarly, 10 years is better than 3 years, but from the perspective of thinking one might have enough time to be confident of getting it right (e.g. 1,000 years), they’re both incredible pressure and very early, panic / extreme stress is a natural response; you’re in a terrible crisis and don’t have any guarantees of being able to get an acceptable outcome.
I am responding to something of a missing mood about the crisis and lack of guarantee of any good outcome. For instance, in many 10-year worlds, we have no hope and are already dead yet walking, and the few that do require extremely high-performance in lots and lots of areas to have a shot, and that reads to me not to be found in the parts of this discussion that hold that it’s plausible humanity will survive in the world histories where we have 10 years until human-superior AGI is built.
What are your favorite asks for governments?
Thanks!
Nod, part of my motivation here is that AI Futures and MIRI are doing similar things, AI Futures’ vibe and approach feels slightly off to me (in a way that seemed probably downstream of Buck/Redwood convos), and… I don’t think the differentiating cruxes are that extreme. And man, it’d be so cool, and feels almost tractable, to resolve some kinds of disagreements… not to the point where the MIRI/Redwood crowd are aligned on everything, but, like, reasonably aligned on “the next steps”, which feels like it’d ameliorate some of the downside risk.
(I acknowledge Eliezer/Nate often talking/arguing in a way that I’d find really frustrating. I would be happy if there were others trying to do overton-shifting that acknowledged what seem-to-me to be the hardest parts)
My own confidence in doom isn’t because I’m like 100% or even 90% on board with the subtler MIRI arguments, it’s the combination of “they seem probably right to me” and “also, when I imagine Buck world playing out, that still seems >50% likely to get everyone killed.[1] Even if for somewhat different reasons than Eliezer’s mainline guesses.[2]
Nod, I was hoping for more like, “what are those asks/strategy?′
Something around here seems cruxy although not sure what followup question to ask. Have there been past examples of companies changing behavior that you think demonstrate proof-of-concept for that working?
(My crux here is that you do need basically all companies bought in on a very high level of caution, which we have seen before, but, the company culture would need to be very different from a move-fast-and-break-things-startup, and it’s very hard to change company cultures, and even if you got OpenAI/Deepmind/Anthropic bought in (a heavy lift, but, maybe achievable), I don’t see how you stop other companies from doing reckless things in the meanwhile)
This probably is slightly-askew of how you’d think about it. In your mind what are the right questions to be asking?
This seems wrong to me. I think Eliezer[3] would probably still bet on humanity losing in this scenario, but, I think he’d think we had noticeably better odds. Less because “it’s near-impossible to extract useful work out of safely controlled near-human-intelligence”, and more:
A) in practice, he doesn’t expect researchers to do the work necessary to enable safe longterm control.
And b) there’s a particular kind of intellectual work (“technical philosophy”) they think needs to get done, and it doesn’t seem like the AI companies focused on “use AI to solve alignment” are pointed in remotely the right direction for getting that cognitive work done.” And, even if they did, 10 years is still on the short side, even with a lot of careful AI speedup.
or at least extremely obviously harmed, in a way that is closer in horror-level to “everyone dies” than “a billion people die” or “we lose 90% of the value of the future”
i.e. Another (outer) alignment failure story, and Going Out With a Whimper, from What failure looks like
I don’t expect him to reply here but I am curious about @Eliezer Yudkowsky or maybe @Rob Bensinger’s reply
I don’t feel competent to have that strong opinion on this, but I’m like 60% on “you need to do some major ‘solve difficult technical philosophy’ that you can only partially outsource to AI, that still requires significant serial time.”
And, while it’s hard for someone withy my (lack-of) background to have a strong opinion, it feels intuitively crazy to me put that as <15% likely, which feels sufficient to me to motivate “indefinite pause is basically necessary, or, humanity has clearly fucked up if we don’t do it, even if it turned out to be on the easier side.”
I think it’s really important to not equivocate between “necessary” and “humanity has clearly fucked up if we don’t do it.”
“Necessary” means “we need this in order to succeed; there’s no chance of success without this”. Because humanity is going to massively underestimate the risk of AI takeover, there is going to be lots of stuff that doesn’t happen that would have passed cost-benefit analysis for humanity.
If you think it’s 15% likely that we need really large amounts of serial time to prevent AI takeover, then it’s very easy to imagine situations where the best strategy on the margin is to work on the other 85% of worlds. I have no idea why you’re describing this as “basically necessary”.
My view is that we can get a bunch of improvement in safety without massive shifts to the Overton window and poorly executed attempts at shifting the Overton window with bad argumentation (or bad optics) can poison other efforts.
I think well-executed attempts at massively shifting the Overton window are great and should be part of the portfolio, but much of the marginal doom reduction comes from other efforts which don’t depend on this. (And especially don’t depend on this happening prior to direct strong evidence of misalignment risk or some major somewhat-related incident.)
I disagree on the specifics-level of this aspect of the post and think that when communicating the case for risk, it’s important to avoid bad argumentation due to well-poisoning effects (as well as other reasons, like causing poor prioritization).