Epistemic status: We really need to know. (I also posted an opinionated answer.)
There’s a well-known diagram from a tweet by Chris Olah of Anthropic:
It would be marvelous to know what the actual difficulty is, out of those five labeled difficulty categories (ideally, exactly where it lies on that spectrum). This is a major crux that explains a large part of the very wide variation in
I think this is a vital discussion. I’m also going to link-post below to my own (rather long) opinion, which is a separate post, and also to a few other existing resources which are basically other people’s attempts to answer this question.
The first three labels on Chris’s diagram are pretty self-explanatory: the only interesting question is whether “Steam Engine” means just steam engine safety work (which would make the scale more logarithmic, and also might make doing a progress-so-far comparison more natural), or also covers what one might call “steam engine capabilities” work, since those are pretty separable.
For rocketry I think it’s a lot more difficult to separate getting there and back with only a 10% fatality rate (as Apollo did) from getting there and back at all, so rather than trying to separate just safety, I think it makes the most sense to do a comparison to all of rocketry (so as to begin at the beginning) that led up to the Apollo program (probably excluding the parallel Russian program, or the more military-specific aspects of various other programs). However, the Apollo program itself was so enormous that when to start from is rather a small quibble.
P vs NP
We’ve only spent about 3,000 to 6,000 person-years on
Eliezer Yudkowsky, the most famous proponent of high
Shades of Impossible
I think it might be useful to provide some more granularity on “Impossible”:
Mathematically Impossible
The Orthogonality Thesis clearly predicts that it’s not actually impossible for an aligned ASI to exist, so unless that’s wrong, an impossibility proof would have to be something like demonstrating that identifying or constructing an aligned ASI, even at very low but finite
Kardashev II
Another alternative is that alignment is not mathematically impossible-in-theory in the above sort of sense, but that it’s just vastly harder than any of the other four labeled categories — perhaps even to the level where it’s currently impossible-in-practice. If, for any sapient species around our current development level, even those that proceeded forward from this point with millennia of caution before finally creating ASI, they still had negligible success chance, and also negligible chance of surviving the attempt, then it could reasonably be said to be impossible in practice. I’d similarly be very interested to hear arguments for this viewpoint. I’m nominating “Kardashev II”[4] as a name for this difficulty level: possible in theory, but something that we’re very clearly not going to manage any time soon.
Too Hard for Us
If Aligning AI is anywhere past Trivial, yet we rush ahead and build ASI anyway before we’re ready, then (unless we manage to luck into alignment-by-default) we’re likely to end up extinct or permanently disempowered. Even if we proceed cautiously, but then mess up our first critical try, we’re still all dead or enslaved. Obviously you can’t solve alignment if you’re dead.
Holding this viewpoint often says rather less about your opinion about how hard aligning AI actually is, and rather more about your opinion of the frailty and foolishness of humans and their institutions.
I would like to thank everyone who made suggestions, gave input, or commented on earlier drafts of this pair of posts, including (in alphabetical order): Hannes Whittingham, JasonB, Justin Dollman, Olivia Benoit, Scott Aaronson, Seth Herd, and the gears to ascension.
- ^
I originally wanted to post this simply as a Question post plus my long, detailed Answer, as one of (hopefully) several answers; one of my draft commenters persuaded me that quite a few LessWrong / Alignment Forum readers typically don’t click on question posts, because they expect to find only quick-take answers rather than detailed extensively-researched answers.
Nevertheless, I encourage detailed heavily researched answers to this Question, as well as quick takes.
- ^
Eliezer has very carefully not publicly stated his current
, but it’s generally agreed by people familiar with him and his writings that it must be over 90%, very likely over 95% — he has, after all, written a list of 43 different reasons why we’re doomed if we don’t pause AI, and co-authored a best-selling book on the subject titled If Anyone Builds It, Everyone Dies. When interviewed on the subject in 2023 he summarized this as:“I think it gets smarter than us, I think we’re not ready, I think we don’t know what we’re doing, and I think we’re all going to die.”
He appears rather certain about this.
However, as the title of the book he co-authored suggests, it’s more accurate to describe him as having a very high
: he has given a TED talk in which he says that any not-DOOM credence he has is predicated on society doing something that he is publicly advocating for, but unfortunately doesn’t expect to happen on the current trajectory: enforceably pausing AI. - ^
For anyone curious, none of my Question or Answer were written by an LLM (though they were proofread by an LLM, and, as I mention in my answer, I did have them do some Fermi estimates and even delve into some deep web research for me). I have been overusing the m-dash since well before transformers were invented: it’s slightly old-fashioned, a little informal, and has useful differences in emphasis from the colon, the semi-colon, or the full stop. I considered giving up the m-dash when LLMs started copying me — and I rejected it. (I am also fond of parenthetical n-dashes for emphasis – and use them on occasion in my writing – though that seems to be less of a hot-button issue.)
- ^
I am here intending this to describe a civilization capable of things like using gathered solar power to lift the contents of all the gas giants out of their gravity wells and then fusing most of the hydrogen and helium into elements more useful for building Dyson swarm computronium, such as carbon/oxygen/nitrogen, and then perhaps also doing some star-lifting: a mature Kardashev Type II civilization, not just one with a lot of solar panels orbiting the sun.
Nobody knows. Not to within an order of magnitude. Seriously. I’ve burned a lot of time taking all the arguments seriously, and none on either side are very complete.
This is an unfortunate position to be in. The sane thing to do is slow down if you don’t know how dangerous the road ahead is. But humans are short-sighted. I expect those in power to see this rather simple fact (experts disagree wildly) and realize they should slow down, but I fear that could happen too late.
This is one of my biggest topics of interest. I read everything on it. The arguments on both sides are strong. I have found absolutely no discussion or writing that comes even close to getting to the bottom of squaring the different models that produce arguments on each side. Arguments on both sides are very abstract.
Nobody on either side, nor in the middle, has gotten even close to the object level. That’s because it’s a really hard problem. It requires either solving alignment-in-general, for any sort of mind; the agent foundations people have mostly (and rightly I think) given up on getting that done any time soon. But we don’t have to align all minds, just the first one(s) significantly smarter than we are. And we don’t have to align them perfectly to human values, just get them to reliably follow instructions from a just minimally smart and kind (set of) human(s).
Figuring out how hard alignment is is a solvable problem. It gets easier as we get closer to building the first AGIs because it scopes the problem. Of course, time to do so and use that information to good effect is also terrifyingly short if we wait for anything like certainty on what the first AGI will actually look like. That’s why I’m spending a lot of time predicting AGI architecture while working on the alignment problem.
I won’t even mention my actual guess on how hard alignment is, because like everyone else’s, it’s just a guess. I’ve spent almost as much time on this as anyone, and I don’t yet have much clue. (My technical work is based on my guesses, because what else can you do?)
And I’ll just pitch here that your answer is among the current-best pieces at actually working through where the abstract arguments for doom meet the reality of what’s happening now, and therefore the most likely path to first takeover-capable AGI.
Now, that’s for technical alignment. The additional problems of societal alignment (whose values is it aligned to, and how does that all shake out) are a different ball of wax. The two intertwine; I personally suspect that technical alignment for LLM-based AGI is pretty do-able, but the lack of societal alignment makes it effectively devilishly difficult.
The uncertainty also means, from our current perspective: NO FATE. It’s time to work, not celebrate or despair.
I don’t expect that to happen until at least some experts are saying that the danger is imminent, rather than a few years away, and probably not until we get a moderately impressive near miss that supports this claim. Currently, basically everyone still agrees that models are not existentially dangerous yet.
Racing all they way to the edge of the precipice and then slamming the brakes on at the very last moment like we’re playing Chicken has a very obvious failure mode — nevertheless, that’s what I’m expecting society to attempt to do. Which unfortunately means those of us taking part in the public discourse need to not be the Boy Who Cried Wolf before we’re actually getting within clear sight of the edge, and restrain ourselves to posting warnings about the probability of wolves ahead.
Absolutely! I have some opinions on that too, but that seems like an area where the people who work on governance problems probably have more leverage.
Yes, my question (and my answer) are about how hard a problem technical alignment is; and as I discuss in it, it’s assuming that, at the moment, the best path to primarily work on is on how to align LLMs, for three reasons: because ASI will probably happen sooner if it happens via LLMs than via some other architecture, because if our ASI isn’t LLM-based or LLM-like it may well still contain an LLM as a subcomponent or I/O device (or at least something that was trained via distilling information and behavior from humans using SGD), and because it’s generally more productive to work on something that already exists than something still mostly hypothetical. I’m glad there are people like Steven Byrnes working on other approaches to aligning other sorts of ASI, having a range of bets is good, but I think for now putting the bulk of our effort into aligning LLMs makes sense.
This should probably be a recurring question, ala the Open Threads LW moderators make, but to put it in a short sentence, alignment has gotten easier, but humanity has gotten more incompetent and is unwilling to pay large costs for safety.
The reason I say alignment has gotten easier is that we have slowly started to realize that the original goal needed to be revised in part by lowering the capability target.
One of the insights of AI control is that we (probably) don’t actually need to consider aligning super-intelligences in the limit of technological development, or anywhere close to that, and that the first AIs that are both massively useful and pose non-negligible risk of AI takeover are able to be controlled in a way that doesn’t depend on AI alignment working.
To be clear, it’s still quite daunting as a challenge and AI companies/governments have started to be more reckless in AI deployment/progress, so it’s still easy for misalignment to occur, especially if we get more unfavorable paradigms (neuralese actually working would be the big one here, but even more prosaic continual learning/long-term memory could be a big problem for AI alignment)
My median/modal expectation conditional on AI being able to automate all of AI R&D is that we implement half-baked control/alignment, and things are very messy and lots of balls are dropped, but we ultimately survive the ordeal based on cheap strategies like satiating AI preferences working, but that we incur a terrifying amount of risk (as in for example taking on 1-5%, or even 10-90% risk of AI takeover) while attempting to solve AI alignment.
I have a long, detailed, opinionated answer, which I have published as a separate post (since one of my draft readers persuaded me that some readers skip Question posts, since they don’t expect to find long, extensively-researched answers).
You should probably also go read Evan Hubinger’s excellent post Alignment remains a hard, unsolved problem, for his recent take on this question.
Eliezer Yudkowski and Nate Soares’ bestselling book If Anyone Builds It, Everyone Dies is also an attempt to answer this question, aimed primarily at a lay audience.