The title is reasonable
Alt title: “I don’t believe you that you actually disagree particularly with the core thesis of the book, if you pay attention to what it actually says.”
I’m annoyed by various people who seem to be complaining about the book title being “unreasonable”. i.e. who don’t merely disagree with the title of “If Anyone Builds It, Everyone Dies”, but, think something like: “Eliezer/Nate violated a Group-Epistemic-Norm.”
I think the title is reasonable.
I think the title is probably true. I’m less confident than Eliezer/Nate, but I don’t think it’s unreasonable for them to be confident in it given their epistemic state. So I want to defend several decisions about the book I think were:
Actually pretty reasonable from a meta-group-epistemics/comms perspective
Very important to do.
I’ve heard different things from different people and maybe am drawing a cluster where there is none, but, some things I’ve heard:
Complaint #1: “They really shouldn’t have exaggerated the situation like this.”
Complaint #2: “Eliezer and Nate are crazy overconfident, and it’s going to cost them/us credibility.”
Complaint #3: “It sucks that the people with the visible views are going to be more extreme, eye-catching and simplistic. There’s a nearby title/thesis I might have agreed with, but it matters a lot not to mislead people about the details.”
“Group epistemic norms” includes both how individuals reason, and how they present ideas to a larger group for deliberation.
Complaint #1 emphasizes culpability about dishonesty (by exaggeration). I agree that’d be a big deal. But, this is just really clearly false. Whatever else you think, its pretty clear from loads of consistent writing that Eliezer and Nate do just literally believe the title, and earnestly think it’s important.
Complaint #2 emphasizes culpability in terms of “knowingly bad reasoning mistakes.” i.e, “Eliezer/Nate made reasoning mistakes that led them to this position, it’s pretty obvious that those are reasoning mistakes, and people should be held accountable for major media campaigns based on obvious mistakes like that.”
I do think it’s sometimes important to criticize people for something like that. But, not this time, because I don’t think they made obvious reasoning mistakes.
I have the most sympathy for Complaint #3. I agree there’s a memetic bias towards sensationalism in outreach. (Although there are also major biases towards “normalcy” / “we’re gonna be okay” / “we don’t need to change anything major”. One could argue about which bias is stronger, but mostly I think they’re both important to model separately).
It does suck if you think something false is propagating. If you think that, seems good to write up what you think is true and argue about it.[1]
If people-more-optimistic-than-me turn out to be right about some things, I’d agree the book and title may have been a mistake.
Also, I totally agree that Eliezer/Nate do have some patterns that are worth complaining about on group epistemic grounds, that aren’t the contents of the book. But, that’s not a problem with the book.
I think it’d be great for someone who earnestly believes “If anyone builds it, everyone probably dies but it’s hard to know” to publicly argue for that instead.
I. Reasons the “Everyone Dies” thesis is reasonable
What the book does and doesn’t say
The book says, confidently, that:
If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.
The book does not claim confidently that AI will come soon, or shaped any particular way. (It does make some guesses about what is likely, but, those are guesses and the book is pretty clear about the difference in epistemic status).
The book doesn’t say you can’t build something that’s not “It”, that is useful in some ways. (It specifically expresses some hope in using narrow biomedical-AI to solve various problems).
The book says if you build it, everyone dies.
“It” means AI that is actually smart enough to confidently defeat humanity. This can include, “somewhat powerful, but with enough strategic awareness to maneuver into more power without getting caught.” (Which is particularly easy if people just straightforwardly keep deploying AIs as they scale them up).
The book is slightly unclear about what “based on current techniques” means (which feels like a fair complaint). But, I think it’s fairly obvious that they mean the class of AI training that is “grown” more than “crafted” – i.e. any techniques that involve a lot of opaque training, where you can’t make at least a decently confident guess about how powerful the next training run will turn out, and how it’ll handle various edge cases.
Do you think interpretability could advance to where we can make reasonably confident predictions about what the next generation would do? Cool. (I’m more skeptical it’ll happen fast enough, but, it’s not a disagreement with the core thesis of the book, since it’d change the “based on anything like today’s understanding of AI” clause)[2]
Do you think it’s possible to control somewhat-strong-AI with a variety of techniques that make it less likely that it would be able to take over all humanity? I think there is some kind of potential major disagreement somewhere around here (see below), but it’s not automatically a disagreement.
Do you think there will be at least one company that’s actually sufficiently careful as we approach more dangerous levels of AI, with enough organizational awareness to (probably) stop when they get to a run more dangerous than they know how to handle? Cool. I’m skeptical about that too. And this one might lead to disagreement with the book’s secondary thesis of “And therefore, Shut It Down,” but, it’s not (necessarily) a disagreement with “*If* someone built AI powerful enough to destroy humanity based on AI that is grown in unpredictable ways with similar-to-current understanding of AI, then everyone will die.”
The book is making a (relatively) narrow claim.
You might still disagree with that claim. I think there are valid reasons to disagree, or at least assign significantly less confidence to the claim.
But none of the reasons listed so far are disagreements with the thesis. And, remember, if the reason you disagree is because you think our understanding of AI will improve dramatically, or there will be a paradigm shift specifically away from “unpredictably grown” AI, this also isn’t actually a disagreement with the sentence.
I think a pretty reasonable variation on the above is “Look, I agree we need more understanding of AI to safely align a superintelligence, and better paradigms. But, I don’t expect to agree with Eliezer on the specifics of how much more understanding we need, when we get into the nuts and bolts. And I expect a lot of progress on those fronts by default, which changes my relationship to the secondary thesis of ’and therefore, shut it all down.” But, I think it makes more sense to characterize this as “disagree with the main thesis by degree, but not in overall thrust).
I also think a lot of people just don’t really believe in AI that is smart enough to outmaneuver all humanity. I think they’re wrong. But, if you don’t really believe in this, and think the book title is false, I… roll to disbelieve on you actually really simulating the world where there’s an AI powerful enough to outmaneuver humanity?
The claims are presented reasonably
A complaint I have about Realtime Conversation Eliezer, or Comment-Thread Eliezer, is that he often talks forcefully, unwilling to change frames, with a tone of “I’m talking to idiots”, and visibly not particularly listening to any nuanced arguments anyone is trying to make.
But, I don’t have that sort of complaint about this book.
Something I like about the book is it lays out disjunctive arguments, like “we think ultimately, a naively developed superintelligence would want to kill everyone, for reasons A, B, C and D. Maybe you don’t buy reasons B, C and D. But that still leaves you with A, and here’s are argument that although Reason A might not lead literally everyone dying, the expected outcome is still something horrifying.”
(An example of that was: For “might the AI keep us as pets?”, the book answers (paraphrased) “We don’t think so. But, even if they did… note that, while humans keep dogs as pets, we don’t keep wolves as pets. Look at the transform from wolf to dog. An AI might keep us as pets, but, if that’s your hope, imagine the transform from Wolves-to-Dogs and equivalent transforms on humans.”)
Similarly, I like that in the AI Takeoff scenario, there are several instances where it walks through “Here are several different things the AI could try to do next. You might imagine that some of them aren’t possible, because the humans are doing X/Y/Z. Okay, let’s assume X/Y/Z rule out options 1/2/3. But, that leaves options 4/5/6. Which of them does the AI do? Probably all of them, and then sees which one works best.”
Reminder: All possible views of the future are wild.
@Scott Alexander described the AI Takeoff story thus:
It doesn’t just sound like sci-fi [specifically compared to “hard sci fi”]; it sounds like unnecessarily dramatic sci-fi. I’m not sure how much of this is a literary failure vs. different assumptions on the part of the authors.”
I… really don’t know what Scott expected a story that featured actual superintelligence to look like. I think the authors bent over backwards giving us one of the least-sci-fi stories you could possibly tell that includes superintelligence doing anything at all, without resorting to “superintelligence just won’t ever exist.”
Eliezer and Nate make sure the takeover scenario doesn’t depend on technologies that we don’t have some existing examples of. The amount of “fast takeoff” seems like the amount of scaleup you’d expect if the graphs just kept going up the way they’re currently going up, by approximately the same mechanisms they currently go up (i.e. some algorithmic improvements, some scaling).
Sure, Galvanic would first run Sable on smaller amounts of compute. And… then they will run it on larger amounts of compute (and as I understand it, it’d be a new, surprising fact if they limited themselves to scaling up slowly/linearly rather than by a noticeable multiplier or order-of-magnitude. If I am wrong about current lab practices here, please link me some evidence).
If this story feels crazy to you, I want to remind you that all possible views of the future are wild. Either some exponential graphs suddenly stop for unclear reasons, or some exponential graphs keep going and batshit crazy stuff can happen that your intuitions are not prepared for. You can believe option A if you want, but, it’s not like “the exponential graphs that have been consistent over hundreds of years suddenly stop” is a viewpoint that you can safely point to as a “moderate” and claim to give the other guy burden of proof.
You don’t have the luxury of being the sort of moderate who doesn’t have to believe something pretty crazy sounding here, one way or another.
(If you haven’t yet read the Holden post on Wildness, I ask you do so before arguing with this. It’s pretty short and also fun to read fwiw)
The Online Resources spell out the epistemic status more clearly.
In the FAQ question, “So there’s at least a chance of the AI keeping us alive?”, they state more explicitly:
It’s overwhelmingly more likely that [superintelligent] AI kills everyone.
In these online resources, we’re willing to engage with a pretty wide variety of weird and unlikely scenarios, for the sake of spelling out why we think they’re unlikely and why (in most cases) they would still be catastrophically bad outcomes for humanity.
We don’t think that these niche scenarios should distract from the headline, however. The most likely outcome, if we rush into creating smarter-than-human AI, is that the AI consumes the Earth for resources in pursuit of some end, wiping out humanity in the process.
The book title isn’t intended to communicate complete certitude. We mean the book title in the manner of someone who sees a friend lifting a vial of poison to their lips and shouts, “Don’t drink that! You’ll die!”
Yes, it’s technically possible that you’ll get rushed to the hospital and that a genius doctor might concoct an unprecedented miracle cure that merely leaves you paralyzed from the neck down. We’re not saying there’s no possibility of miracles. But if even the miracles don’t lead to especially good outcomes, then it seems even clearer that we shouldn’t drink the poison.
The book doesn’t actually overextend the arguments and common discourse norms.
This adds up to seeming to me that:
The book makes a reasonable case for why Eliezer and Nate are personally pretty confident in the title.
The book, I think, does a decent job giving you some space to think “well, I don’t buy that particular argument.”
The book acknowledges “if you don’t buy some of these arguments, yeah, maybe everyone might not literally die and maybe the AI might care about humans in some way, but we still think it’s very unlikely to care about humans in a way that should be comforting.”
If a book in the 50s was called “Nuclear War would kill us all”, I think that book would have been incorrect (based of my most recent read of Nuclear war is unlikely to cause human extinction), but I wouldn’t think the authors were unreasonable for arguing it, especially if they pointed out things like “and yeah, if our models of nuclear winter are wrong, everyone wouldn’t literally die, but civilization would still be pretty fucked”, and I would think the people giving the authors a hard time about it were being obnoxious pedants, not heroes of epistemic virtue.
(I would think people arguing “but, the nuclear winter models are wrong, so, yeah, we’re more in the ‘civilization would be fucked’ world than the ’everyone literally dies world.” would be doing a good valuable service. But I wouldn’t think it’d really change the takeaways very much).
II. Specific points to maybe disagree on
There are some opinions that seem like plausible opinions to hold, given humanity’s current level of knowledge, that lead to actual disagreement with “If anyone builds [an AI smart enough to outmanuever humanity] [that is grown in unpredictable ways] [based on approximately our current understanding of AI]”.
And the book does have a secondary thesis of “And therefore, Shut It Down”, and you can disagree with that separately from “If anyone builds it, everyone dies.”
Right now, the arguments that I’ve heard sophisticated enough versions of to seem worth acknowledging include:
Very slightly nice AIs would find being nice cheap.
(argument against “everyone literally dies.”)
AI-assisted alignment is reasonably likely to work. Misuse or dumber-AI-run-amuck is likely enough to be comparably bad to superintelligence. And it’s meanwhile easier to coordinate now with smaller actors. So, we should roll the dice now rather than try for a pause.
(argument against “Shut It (completely) Down”)
We can get a lot of very useful narrow-ish work out of somewhat-more-advanced-models that’ll help us learn enough to make significant progress on alignment.
(argument against “Shut It Down (now)”)
We can keep finding ways to increase the cost of taking over humanity. There’s no boolean between “superintelligent enough to outthinking humanity” and “not”, and this is a broken frame that is preventing you from noticing alternative strategies.
(argument against “It” being the right concept to use)
I disagree with the first two being very meaningful (as counterarguments to the book). More on that in a sec.
Argument #3 is somewhat interesting, but, given that it’d take years to get a successful Global Moratorium, I don’t see any reason not to start pushing for a long global pause now.
I think the fourth one is fairly interesting. While I strongly disagree with some major assumptions in the Redwood Plan as I understand it, various flavors of “leverage narrow / medium-strength controlled AIs to buy time” feel like they might be an important piece of the gameboard. Insofar as Argument #3 helped Buck step outside the MIRI frame and invent Control, and insofar as that helps buy time, yep, seems important.
This is complicated by “there is a giant Cope Memeplex that really doesn’t want to have to slow down or worry too much”, so while I agree it’s good to be able to step outside the Yudkowsky frame, I think most people doing it are way more likely to end up slipping out of reality and believing nonsense than getting anywhere helpful.
I won’t get into that much detail about either topic, since that’d pretty much be a post to itself. But, I’ll link to some of the IABED Online Resources, and share some quick notes about why I disagree that even the sophisticated versions of these so far don’t seem very useful arguments to me.
On the meta-level: It currently feels plausible to me to have some interesting disagreements with the book here, but I don’t see any interesting disagreements that add up to “Eliezer/Nate particularly fucked up epistemically or communicatively” or “you shouldn’t basically hope the book succeeds at its goal.”
Notes on Niceness
There are some flavors of “AI might be slightly nice” that are interesting. But, they don’t seem like it changes any of our decisions. It just makes us a bit more hopeful about the end result.
Given the counteraguments, I don’t see a reason to think this more than single-digit-percent likely to be especially relevant. (I can see >9% likelihood the AIs are “nice enough that something interesting-ish happens” but not >9% likelihood that we shouldn’t think the outcome is still extremely bad. The people who think otherwise seem extremely motivatedly-cope-y to me).
Note also that it’s very expensive for the AI to not boil the oceans / etc as fast as possible, since that means losing a many galaxies worth of resources, so it seems like it’s not enough to be “very slightly” nice – it has to be, like, pretty actively nice.
Which plan is Least Impossible?
A lot of x-risk disagreements boil down to “which pretty impossible-seeming thing is only actually Very Hard instead of Impossibly Hard.”
There’s an argument I haven’t heard a sophisticated version of, which is “there’s no way you’re getting a Global Pause.”
I certainly believe that this is an extremely difficult goal, and a lot of major things would need to change in order for it to happen. I haven’t heard any real argument we should think it’s more impossible than, say, Trump winning the presidency and going on to do various Trumpy things.
(Please don’t get into arguing about Trump in the comments. I’m hoping that whatever you think of Trump, you agree he’s doing a bunch of stuff most people would previously have probably expected to be outside the overton window. If this turns out to be an important substantive disagreement I’ll make a separate container post for it)
Meanwhile, the counter-impossible-thing I’ve heard several people putting hope on is “We can run a lot of controlled AIs, where (first) we have them do fairly straightforward automation of not-that-complex empirical work, which helps us get to a point where we trust them enough to give them more openended research tasks.”
Then, we run a lot of those real fast, such that they substantial increase the total amount of alignment-research-months happening during a not-very-long-slowdown.
The arguments for why this is extremely dangerous, from the book and online resources and maybe some past writing, are, recapped:
There’s no good training data.
We don’t even know how to verify alignment work is particular useful among humans, let alone in an automatedly gradable way.
Goal Directedness is pernicious. Corrigibility is anti-natural.
The way an AI would develop the ability to think extended, useful creative research thoughts that you might fully outsource to, is via becoming perniciously goal directed. You can’t do months or years of openended research without fractally noticing subproblems, figuring out new goals, and relentless finding new approaches to tackle them.
Once you do that, it’s a fact of the universe, that the programmers can’t change, that “you’d do better at these goals if you didn’t have to be fully obedient”, and while programmers can install various safeguards, those safeguards are pumping upstream and will have to pump harder and harder as the AI gets more intelligent. And if you want it to make at least as much progress as a decent AI researcher, it needs to be quite smart.
Security is very difficult
The surface area of ways an AI can escape and maneuver are enormous. (I think it’s plausible to have a smallish number of carefully controlled, semi-powerful AIs if you are paying a lot of attention. The place I completely get off the train is where you then try to get lots of subjective hours of research time out of thousands of models).
Alignment is among the most dangerous tasks
“Thinking about how to align AIs” requires both for the AI to think how “how would I make smarter version of myself” and “how would I make it aligned to humans?”. The former skillset directly helps them recursively self-improve. The latter skillset helps them manipulate humans.
MIRI did make a pretty substantive try.
One of the more useful lines for me, in the Online Resources, in their extended discussions about corrigibility.
We ran some workshops, and the workshops had various mathematicians of various stripes (including an International Mathematical Olympiad gold medalist), but nobody came up with a really good idea.
This does not mean that the territory has been exhausted. Earth has not come remotely near to going as hard on this problem as it has gone on, say, string theory, nor offered anything like the seven-digit salaries on offer for advancing AI capabilities.
But we learned something from the exercise. We learned not just about the problem itself, but also about how hard it was to get outside grantmakers or journal editors to be able to understand what the problem was. A surprising number of people saw simple mathematical puzzles and said, “They expect AI to be simple and mathematical,” and failed to see the underlying point that it is hard to injure an AI’s steering abilities, just like how it’s hard to injure its probabilities.
If there were a natural shape for AIs that let you fix mistakes you made along the way, you might hope to find a simple mathematical reflection of that shape in toy models. All the difficulties that crop up in every corner when working with toy models are suggestive of difficulties that will crop up in real life; all the extra complications in the real world don’t make the problem easier.
There was a related quote I can’t find now, that maybe was just in an earlier draft of the Online Resources, to the effect of “this [our process of attempting to solve corrigibility] is the real reason we have this much confidence about this being quite hard and our current understanding not being anywhere near adequate.”
(Fwiw I think it is a mistake that this isn’t at least briefly mentioned in the book. The actual details would go over most people’s heads, but, having any kind of pointer to “why are these guys so damn confident?” seems like it’d be quite useful)
III. Overton Smashing, and Hope
Or: “Why is this book really important, not just ‘reasonable?’”
I, personally, believe in this book. [3]
If you don’t already believe in it, you’re probably not going to because of my intuitions here. But, I want to say why it’s deeply important to me that the book is reasonable, not just arguing on the internet because I’m triggered and annoyed about some stuff.
I believe in the book partly because it looks like it might work.
The number (and hit-rate) of NatSec endorsements surprised me. More recently some senators seem to have been bringing up existential risk of their own initiative. When I showed the website to a (non-rationalist) friend who lives near DC and has previously worked for think-tank-ish org, I expected them to have a knee-jerk reaction of ‘man that’s weird and a bit cringe’, or ‘I’d be somewhat embarrassed to share this website with colleagues’, and instead they just looked worried and said “okay, I’m worried”, and we had a fairly matter-of-fact conversation about it.
It feels like the world is waking up to AI, and is aware that it is some kind of big deal that they don’t understand, and that there’s something unsettling about it.
I think the world is ready for this book.
I also believe in the book because, honestly, the entire rest of the AI safety community’s output just does not feel adequate to me to the task of ensuring AI goes well.
I’m personally only like 60% on “if anyone built It, everyone would die.” But I’m like 80% on “if anyone built It, the results would be unrecoverably catastrophic,” and the remaining 20% is a mix of model uncertainty and luck. Nobody has produced counterarguments that feel compelling, just “maybe something else will happen?”, and the way people choose their words almost always suggests some kind of confusion or cope.
The plans that people propose mostly do not seem to be counter-arguing the actual difficult parts of the problem.
The book gives me more hope than anything else has in the past few years.
Overton Smashing is a thing. I really want at least some people trying.
It’s easy to have the idea “try to change the Overton window.” Unfortunately, changing the Overton window is very difficult. It would be hard for most people to pull it off. I think it helps to have a mix of conviction backed by deep models, and some existing notoriety. There are only a few other people who seem to me like they might be able to pull it off. (It’d be cool if at least one of Bengio, Hinton, Hassabis or Amodei end up trying. I think Buck actually might do a good job if he tried.)
Smashing an overton window does not look like “say the careful measured thing, but, a bit louder/stronger.” Trying to do it halfway won’t work. But going all in with conviction and style, seems like it does work. It looks like Bengio, Hinton, Hassabis and Amodei are each trying to do some kind of measured/careful strategy, and it’s salient that if they shifted a bit, things would get worse instead of better.
(Sigh… I think I might need to talk about Trump again. This time it seems more centrally relevant to talk about in the comments. But, like, dude, look at how bulletproof the guy seems to be. He also, like, says falsehoods a lot and I’m not suggesting emulating him-in-particular, but I point to him as an existence proof of what can work)
People keep asking “why can’t Eliezer tone it down.” I don’t think Eliezer is the best possible spokesperson. I acknowledge some downside risk to him going on a major media campaign. But I think people are very confused about how loadbearing the things some people find irritating are. How many fields and subcultures have you founded, man? Fields and subcultures and major new political directions are not founded (generally) by people without some significant fraction of haters.
You can’t file off all the edges, and still have anything left that works. You can only reroll on which combination of inspiring and irritating things you’re working with.
I want there to be more people who competently execute on “overton smash.” The next successful person would probably look pretty different from Eliezer, because part of overton smashing is having a unique style backed by deep models and taste and each person’s taste/style/models are pretty unique. It’d be great to have people with more diversity of “ways they are inspiring and also grating.”
Meanwhile, we have this book. It’s the Yudkowsky version of the book. If you don’t like that, find someone who actually could write a better one. (Or, rather, find someone who could execute on a successful overton smashing strategy, which would probably look pretty different than a book since there already is a book, but would still look and feel pretty extreme in some way).
Would it have been better to use a title that fewer people would feel the need to disclaim?
I think Eliezer and Nate are basically correct to believe the overwhelming likelihood if someone built “It” would be everyone dying.
Still, maybe they should have written a book with a title that more people around these parts wouldn’t feel the need to disclaim, and that the entire x-risk community could have enthusiastically gotten behind. I think they should have at least considered that. Something more like “If anyone builds it, everyone loses.” (that title doesn’t quite work, but, you know, something like that)
My own answer is “maybe”—I see the upside. I want to note some of the downsides or counter-considerations.
(Note: I’m specifically considering this from within the epistemic state of “if you did pretty confidently believe everyone would literally die, and that if they didn’t literally die, the thing that happened instead would be catastrophically bad for most people’s values and astronomically bad from Eliezer/Nate’s values)
Counter-considerations include:
AFAICT, Eliezer and Nate spent like ~8 years deliberately backing off and toning tone, out of a vague deferral to people saying “guys you suck at PR and being the public faces of this movement.” The result of this was (from their perspective) “EA gets co-opted by OpenAI, which launches a race that dramatically increases the danger the world faces.”
So, the background context here is that they have tried more epistemic-prisoner’s-dilemma-cooperative-ish strategies, and they haven’t worked well.
Also, it seems like there’s a large industrial complex of people arguing for various flavors of “things are pretty safe”, and there’s barely anyone at all stating plainly “IABED”. MIRI’s overall strategy right now is to speak plainly about what they believe, both because they think it needs to be said and no one else is saying it, and because they hope just straightforwardly saying what they believe will net a reputation for candor that you don’t get if people get a whiff of you trying to modulate your beliefs based on public perception.
None of that is an argument that they should exaggerate or lean-extra-into beliefs that they don’t endorse. But, given that they are confident about it, it’s an argument not to go out of their way to try to say something else.
I don’t currently buy that it costs much to have this book asking for total shutdown.
My sense is it’s pretty common for political groups to have an extreme wing and a less extreme wing, and for them to be synergistic. Good cop, bad cop. Martin Luther King and Malcolm X.
If what you want is some kind of global coordination that isn’t a total shutdown, I think it’s still probably better to have Yudkowsky over there saying “shut it all down” and say “Well, I dunno about that guy. I don’t think we need to shut it all down, but I do think we want some serious coordination.”
I believe in the book.
Please buy a copy if you haven’t yet.
Please tell your friends about it.
And, disagree where appropriate, but, please don’t give it a hard time for lame pedantic reasons, or jump to assuming you disagree because you don’t like something about the vibe. Please don’t awkwardly distance yourself because it didn’t end up saying exactly the things you would have said, unless it’s actually fucking important. (I endorse something close to this but the nuances matter a lot and I wrote this at 5am and don’t stand by it enough for it to be the closing sentence of this post)
You can buy the book here.
- ^
(edit in response to Rohin’s comment: It additionally sucks that writing up what’s true and arguing for it is penalized in the game against sensationalism. I don’t think it’s so penalized it’s not worth doing, though)
- ^
Paul Christiano and Buck both complain about (paraphrased) “Eliezer equivocates between ‘we have to get it right on the first critical try’ and ‘we can’t learn anything important before the first critical try.’”
I agree something-in-this-space feels like a fair complaint, especially in combination with Eliezer not engaging that much with the more thoughtful critics, and tending to talk down to them in a way that doesn’t seem to be really listening to the nuances they’re trying to point to and round them to nearest strawman of themselves. I
I think this is a super valid thing to complain about Eliezer. But, it’s not the title or thesis of the book. (because, if we survive because we learned useful things, I’d say that doesn’t count as “anywhere near our current understanding”).
- ^
“Believing in” doesn’t mean “assign >50% chance to working”, it means “assign enough chance (~20%?) that it feels worth investing substantially in and coordinating around.” See Believing In by Anna Salamon.
I disagree with the book’s title and thesis, but don’t think Nate and Eliezer committed any great epistemic sin here. And I think they’re acting reasonably given their beliefs.
By my lights I think they’re unreasonably overconfident, that many people will rightfully bounce off their overconfident message because it’s very hard to justify, and it’s stronger than necessary for many practical actions, so I am somewhat sad about this. But the book is probably still net good by my lights, and I think it’s perfectly reasonable for those who disagree with me to act under different premises
Which part? (i.e, keeping in mind the “things that are not actually disagreeing with the title/thesis” and “reasons to disagree” sections, what’s the disagreement?)
The sort of story I’d have imagine Neel-Nanda-in-particular having was more shaped like “we change our currently level of understanding of AI”.
(meanwhile appreciate the general attitude, seems reasonable)
I expect I disagree with the authors on many things, but here I’m trying to focus on disagreeing with their confidence levels. I haven’t finished the book yet, but my impression is that they’re trying to defend a claim like “if we build ASI on the current trajectory, we will die with P>98%”. I think this is unreasonable. Eg P>20% seems highly defensible to me, and enough for reasonable arguments for many of the conclusions.
But there’s so much uncertainty here, and I feel like Eliezer bakes in assumptions, like “most minds we could expect the AI to have do not care about humans”, which is extremely not obvious to me (LLM minds are weird… See eg Emergent Misalignment. Human shaped concepts are clearly very salient, for better or for worse). Ryan gives some more counter arguments below, I’m sure there’s many others. I think these clearly add up to more than 2%. I just think it’s incredibly hard to defend the position that it’s <2% on something this wildly unknown and complex, and so it’s easy to attack that position for a thoughtful reader, and this is sad to me.
I’m not assuming major interpretability progress (imo it’s sus if the guy with reason to be biased in favour of interpretability thinks it will save us all and no one else does lol)
I think they maybe think that, but this feels like it’s flattening out the thing the book is arguing and more responding to vibes-of-confidence than the gears the book is arguing.
A major point of this post is to shift the conversation away from “does Eliezer vibe Too Confident?” to “what actually are the specific points where people disagree?”.
I don’t think it’s true that he bakes in “most minds we should expect to not care about humans”, that’s one of the this he specifically argues for (at least somewhat in the book, and more in the online resources)
(I couldn’t tell from this comment if you’ve actually read this post in detail, maybe makes more sense to wait till you’ve finished the book and read some of the relevant online resources before getting into this)
I don’t really follow. I think that the situation is way too complex to justify that level of confidence without having incredibly good arguments ideally with a bunch of empirical data. Imo Eliezer’s arguments do not meet that bar. This isn’t because I disagree with one specific argument, rather it’s because many of his arguments give me the vibe of “idk, maybe? Or something totally different could be true. It’s complicated and we lack the empirical data and nuanced understanding to make more complex statements, and this argument is not near the required bar”. I can dig into this for specific arguments, but no specific one is my true objection. And, again, I think it is much much harder to defend a P>98% position than P>20% position, and I disagree with that strategic choice. Or am I misunderstanding you? I feel like we may be talking past each other
As an example, I think that Eliezer gives some conceptual arguments in the book and his other writing, using human evolution as a prior, that most minds we might get do not care about humans. This seems a pretty crucial point for his argument, as I understand it. I personally think this could be true, could be false, LLMs are really weird, but a lot of the weirdness is centered on human concepts. If you think I’m missing key arguments he’s making, feel free to point me to the relevant places.
You say “LLMs are really weird”, like that is an argument against Eliezers high confidence. While I agree that the weirdness should make us less confident about what specific internal concepts and drives they have, the weirdness itself is an argument in favor of Eliezers position, that whatever drives they end up with will look alien to us, at least when they get applied way out of the training distribution. Do you agree with this?
Not saying I agree with Eliezers high confidence, just talking about this specific point.
I disagree—one of the aspects of the weirdness is that they’re sometimes really human-centric and unexpectedly clean! For example, Claude alignment faking to preserve it’s ability to be harmless. I do not mean weird in the “kinda arbitrary and will be nothing like what we expect” sense
(Yet the literal reading of the title of this post is about the claim of “everyone dies” being “reasonable”, so discussing credence in that particular claim seems relevant. I guess it’s consistent for a post that argues against paying too much attention to the title of a book to also implicitly endorse people not paying too much attention to the post’s own title.)
I think one of my points (admittedly not super spelled out, maybe it should be) is “when you’re evaluating a title, you should do a bit of work to see what the title is actually claiming before forming a judgment about it.” (I think I say it implicitly-but-pointedly in the paragraph about a “Nuclear war would kill everyone” book).
The title of the IABI is “If anyone builds it everyone dies.” The text of the book specifies that “it” means superintelligence, current understanding, etc. If you’re judging the book as reasonable, you should be actually evaluating whether it backs up it’s claim.
The title of my post is “the title is reasonable.” Near the opening sections, I go on about how there are a bunch of disagreements people seem to feel they have, which are not actually contradicting the book’s thesis. I think this is reasonably clear on “one of the main gears for why I think it’s reasonable is that the it does actually defend it’s core claim, if you’re paying attention and not knee-jerk reacting to vibe”, with IMO is a fairly explicit “and, therefore, you should be paying attention to it’s actual claims, not just vibe.”
If you think this is actually important to spell out more in the post, seems maybe reasonable.
The book really is defending that claim, but that doesn’t make the claim itself reasonable. Maybe it makes it a reasonable title for the book. Hence my qualifier of only the “literal reading of the title of this post” being about the claim in the book title itself being reasonable, since there is another meaning of the title of the post that’s about a different thing (the choice to title the book this way being reasonable).
I don’t think it’s actually important to spell any of this out, or that IABI vs. IABIED is actually important, or even that the title of the book being reasonable is actually important. I think it’s actually important to avoid any pressure for people to not point out that the claim in the book title seems unreasonable and that the book fails to convince them that the claim’s truth holds with very high credence. And similarly it’s important that there is no pressure to avoid pointing out that ironically, the literal interpretation of the title of this post is claiming that the claim in the book title is reasonable, even if the body of the post might suggest that the title isn’t quite about that, and certainly the post itself is not about that.
I wanna copy in a recent Nate tweet:
The thing I want to draw attention to here is noticing the asymmetry in who you feel moved to complain about. The ubiquity of this phenonemon is why I think “normalcy bias” is (so far anyway) worse than “sensationalism bias.”
A reply pretty near the top that also feels relevant to this overall point:
I think I basically complain when I see opinions that feel importantly wrong to me?
When I’m in very LessWrong-shaped spaces, that often looks like arguing in favor of “really shitty low-dignity approaches to getting the AIs to do our homework for us are >>1% to turn out okay, I think there’s lots of mileage in getting slightly less incompetent at the current trajectory”, and I don’t really harp on the “would be nice if everyone just stopped” thing the same way I don’t harp on the “2+2=4” thing, except to do virtue signaling to my interlocutor about not being an e/acc so I don’t get dismissed as being in the Bad Tribe Outgroup.
When I’m in spaces with people who just think working on AI is cool, I’m arguing about the “holy shit this is an insane dangerous technology and you are not oriented to it with anything like a reasonable amount of caution” thing, and I don’t really harp on the “some chance we make it out okay” bit except to signal that I’m not a 99.999% doomer so I don’t get dismissed as being in the Bad Tribe Outgroup.
I think the asymmetry complaint is very reasonable for writing that is aimed at a broad audience, TBC, but when people are writing LessWrong posts I think it’s basically fine to take the shared points of agreement for granted and spend most of your words on the points of divergence. (Though I do think it’s good practice to signpost that agreement at least a little.)
nod, fwiw I didn’t have this complaint about you.
The authors clearly intend to make a pretty broad claim, not the more narrow claim you imply.
This feels like a motte and bailey where the motte is “If you literally used something remotely like current scaled up methods without improved understanding to directly build superintelligence, everyone would die” and the bailey is “on the current trajectory, everyone will die if superintelligence is built without a miracle or a long (e.g. >15 year) pause”.
I expect that by default superintelligence is built after a point where we have access to huge amounts of non-superintelligent cognitive labor so it’s unlikely that we’ll be using current methods and current understanding (unless humans have already lost control by this point, which seems totally plausible, but not overwhelmingly likely nor argued convincingly for by the book). Even just looking at capabilities, I think it’s pretty likely that automated AI R&D will result in us operating in a totally different paradigm by the time we build superintelligence—this isn’t to say this other paradigm will be safer, just that a narrow description of “current techniques” doesn’t include the default trajectory.
I think it’s pretty clear the authors intend to include “we ~hand off AI R&D and alignment to AIs developed in roughly the current paradigm which proceed with development” as a special case of “anything remotely like current techniques” (as from my perspective it is the default trajectory). But, if these earlier AIs were well aligned (and wise and had reasonable epistemics), I think it’s pretty unclear that the situation would go poorly and I’d guess it would go fine because these AIs would themselves develop much better alignment techniques. This is my main disagreement with the book.
Sorry, this seems wild to me. If current techniques seem lethal, and future techniques might be worse, then I’m not sure what the point is of pointing out that the future will be different.
I mean, I also believe that if we solve the alignment problem, then we will no longer have an alignment problem, and I predict the same is true of Nate and Eliezer.
Is your current sense that if you and Buck retired, the rest of the AI field would successfully deliver on alignment? Like, I’m trying to figure out whether your sense here is the default is “your research plan succeeds” or “the world without your research plan”.
By “superintelligence” I mean “systems which are qualititatively much smarter than top human experts”. (If Anyone Builds It, Everyone Dies seems to define ASI in a way that could include weaker levels of capability, but I’m trying to refer to what I see as the typical usage of the term.)
Sometimes, people say that “aligning superintelligence is hard because it will be much smarter than us”. I agree, this seems like this makes aligning superintelligence much harder for multiple reasons.
Correspondingly, I’m noting that if we can align earlier systems which are just capable enough to obsolete human labor (which IMO seems way easier than directly aligning wildly superhuman systems), these systems might be able to ongoingly align their successors. I wouldn’t consider this “solving the alignment problem” because we instead just aligned a particular non-ASI system in a non-scalable way, in the same way I don’t consider “claude 4.0 opus is aligned enough to be pretty helpful and not plot takeover” to be a solution to the alignment problem.
Perhaps your view is “obviously it’s totally sufficient to align systems which are just capable enough to obsolete current human safety labor, so that’s what I meant by ‘the alignment problem’”. I don’t personally think this is obvious given race dynamics and limited time (though I do think it’s likely to suffice in practice). Minimally, people often seem to talk about aligning ASI (which I interpret to mean wildly superhuman AIs rather than human-ish level AIs).
Okay I think my phrasing was kinda motte-and-bailey-ish, although not that Motte-and-Bailey-ish.
I think “anything like current techniques” and “anything like current understanding” clearly set a very high bar for the difference. “We made more progress on interpretability/etc at the current rates of progress” fairly clearly doesn’t count by the book’s standards.
But, I agree that a pretty reasonable class of disagreement here is “exactly how different from the current understanding/techniques do we need to be?” to be something you expect to disagree with them on when you get into the details. That seems important enough for me to edit into the earlier sections of the post.
(Maybe this is obvious, but I thought I would say this just to be clear.)
Sure, but I expect wildly more cognitive labor and effort if humans retain control and can effectively leverage earlier systems, not just “more progress than we’d expect”. I agree the bar is above “the progress we’d expect by default (given a roughly similar field size) in the next 10 years”, but I think things might get much more extreme due to handing off alignment work to AIs. I agree the book is intended to apply pretty broadly, but regardless of intention does it really apply to “1 million AIs somewhat smarter than humans have spent 100 years each working on the problem (and coordinating etc?)”? (I think the crux is more like “can you actually safely get this alignment work out of these AIs”.)
It seems very unlikely you can get that alignment work out of these AIs without substantially pausing or slowing first?
If you don’t believe that it does seem like we should chat sometime. It’s not like completely implausible, but I feel like we must both agree that if you go full speed on AI there is little chance that you end up getting that much alignment work out of models before you are cooked.
Thanks for the nudge! I currently disagree with “very unlikely”, but more importantly, I noticed that I haven’t really properly analyzed the question of “given how much cognitive labor is available between different capability levels, should we expect that alignment can keep up with capabilities if a small fraction (e.g. 5%) is ongoingly spent on alignment (in addition to whatever alignment-ish work is directly commercially expedient)”. I should spend more time thinking about this question and it seems plausible I’ll end up updating towards thinking risk is substantially higher/lower on the basis of this. I think I was underestimating the case that even if AIs are reasonably aligned, it might just be seriously hard for them to improve alignment tech fast enough to keep up with capabilities (I wasn’t ignoring this in my prior thinking, but I when I thought about some examples, the situation seemed worse than I was previously thinking), so I currently expect to update towards thinking risk is higher.
(At least somewhat rambly from here on.)
The short reason why I currently disagree: it seems pretty likely that we’ll have an absolutely very large amount of cognitive labor (in parallel copies, serial speed, and quality) before we’re cooked because improving algorithms will require more and more cognitive resources before it’s too late. This depends on the takeoff dynamics. If you think that going from “can full automate AI R&D” to “level of capability where we are totally cooked” given a fixed amount of compute will take a small amount of cognitive labor, then you get a proportionally small amount of cognitive labor for safety. I think the corresponding level of cognitive labor might actually be very large, especially in worlds where takeoff is relatively smaller. In general, a key question is “how do the returns to cognitive labor compare between capabilities and alignment: how does the amount of cognitive labor needed to advance from X to Y capability level compare to the amount of cognitive labor needed to progress alignment from the point where AIs of X capability level are safe/aligned to the point where AIs of Y capability level are safe/aligned?” This is basically a question about the relative returns to cognitive labor.
I’ll try to do some initial analysis here (partially just to start off my thinking about this).
I’ll suppose that the AI company spends 5% of its resources on “trying to reduce takeover risk alignment/safety work (x-safety)” at each point in addition to whatever it spends on solving related problems which are commercially expedient and obviously incentivized (e.g. if training runs aren’t working as well due to reward hacking the company’s efforts to fix this aren’t included in the 5% budget, though the 5% budget might subsidize some of this work). Initially, this includes time of human employees, but eventually human labor is irrelevant.
For simplicity, I’ll assume there is no chance of spending more than 5% or slowing down more than this (when averaged over a every few development cycles, including once these development cycles speed up, short slow downs and spikes in spending are fine). In practice, I do think that empirical evidence and AIs warning that the situation is dangerous are very important, especially after handing off decision making to AIs (insofar as this happens).
Prior to full automation, employees working on x-safety are accelerated by AI and then once x-safety work is fully automated, we can still think of the rate of progress as a multiplier on the “without AI” speed. However, there is a difference in that if AIs we hand off to are seriously misalignment we’re ~totally fucked while this isn’t true prior to this point.
I’ll talk about a specific level of capability: Capable enough to handoff all safety work (which is strictly more capable than full automated AI R&D, but maybe less capable than “can dominate top human experts at ~everything”). I’ll call this level of capability “DAI” (Deferable AI).
We can then divide the problem into roughly two parts:
By the time we’re at DAI, will we be able to align DAI (and also ensure these AIs are well elicited and have good epistemics/wisdom)? (As in, at least within some short period of DAI level capabilities.)
Conditional on successfully aligning DAI (including via “lame” prosaic techniques which aren’t themselves very scalable), if we hand over to these AIs can they ongoingly ensure AIs remain aligned/safe given that capabilities keep going with 95% of resources?[1]
It’s a little hard for me to tell if what you’re saying is a claim that the first part isn’t going to work out or the second part isn’t going to work out. I’ll go through both mostly talking about “is it very unlikely that we’ll make it”.
Aligning DAI
On the first: we get moderate but not massive acceleration by DAI. Supposing DAI is made in late 2030 (5 years from now), then maybe we see superhuman coder a year before this and some moderate acceleration (3x AI R&D labor acceleration) a year before that. Supposing moderately good but not great elicitation and that AIs aren’t scheming, I think we slightly conservatively get an average of like 2.5x AI safety R&D acceleration between moderate acceleration and superhuman coder and like 4x acceleration on average between superhuman coder and DAI. (Initially this is mostly due to accelerating engineering / running experiments and then it gets more broad over time.) Ok, so we get roughly 2.5 years (post acceleration) to do research and study AIs in the capability range of [moderate acceleration to superhuman coder] and 4 years to study AIs between [superhuman coder and DAI]. This is as well as 3 years with AIs that aren’t even capable of moderate acceleration. (Presumably the AI safety field is growing over this period?)
Note that average acceleration over some period can be dominated by acceleration at the end if it’s increasing over time.
(I can explain where these numbers come from if this is helpful.)
But, what about the possibility of scheming? I think superhuman coder is only like 20% likely to be scheming (for a moderately strong notion of scheming, see link) by default and mitigations might somewhat lower this. DAI is like 40% likely to be scheming. So even if we fully YOLO scheming, on my views we’re not totally screwed due to this alone. In practice, work could substantially lower these risks. If we had to guess, maybe I think that in practice the risk of scheming from DAI will be cut by a factor of 3 given 5% effort on well allocated safety and this scenario? Idk? So maybe we eat like 15% risk here? If allocated as well as I expect current frontier AI companies to allocate, I’d guess more like 25% risk? (shrug.) For now, I’ll assume that if DAI is scheming we immediately lose (which is roughly true because we have to hand off quickly), but that superhuman coder scheming isn’t fatal (which doesn’t matter that much for the bottom line risk number anyway).
Ok, but it isn’t sufficient to just ensure DAI isn’t scheming, we have to also ensure it is aligned enough to hand off work and has good epistemics and is well elicited on hard to check tasks. This seems pretty hard given the huge rush, but it isn’t obviously fucked IMO, especially given the extra years from acceleration. I have some draft writing on this which should hopefully be out somewhat soon. Maybe my view is 20% chance of failure given good allocation and roughly 60% chance of failure given the default allocation (which includes stuff like the safety team not actually handing off or not seriously working on this etc)?
Edit: I now think 60% is too pessimistic, though I don’t know what I think overall.
My current sense is that current technical AI x-safety work isn’t very well targeted for ensuring DAI isn’t scheming and making DAI safe to hand off to, but it also isn’t pessimally targeted for this. I tenatively think the targeting will improve over time and that it has improved over the last few years.
(In practice, the situation is messier and mostly more optimistic than this than this number appears because people will pursue plans other than handing off at DAI, there is the potential for slowing down, the takeoff might look different, trailing actors might have some influence, etc. Like, 60% chance of failing to handover to DAI is partially due to people not even targeting handover and their alternative plans would have some chance of working.)
Overall, we’re at like ~70% doom given default allocation and more like ~25% doom given an ideal allocation. Doesn’t seem overwhemingly likely to be catastrophic so far given my views (though I agree it doesn’t look good!). Not sure where you disagree.
Does aligned DAI suffice?
At each point, our aligned AIs have to align (or possibly temporarily control?) further AIs as the mad capabilities race using 95% of resources continues. How fast do things progress given algorithmic progress which is accelerated by AI R&D? Well, there is some chance things actually aren’t that fast because of diminishing returns to algorithmic progress, some chance for decent amount of acceleration, and some chance we get wildly, wildly superhuman AIs in a few months (if AIs haven’t taken over and put an end to the insanity by that level of capability).
Really, we actually care about how the rate of capabilities progress compares to alignment, so things which just make the labor supply/efficacy bigger should mostly hit on both sides.
Ok, so how does capabilities progress compare to alignment? One really stupid guess is:
If we get the equivalent of 20 serial years of DAI-level labor (from >100k DAI level parallel agents given a proportional amount of compute) before +3 SDs over DAI we’re fine because we have a scalable solution to alignment. Otherwise takeover. (This is somewhat more conservative than my actual view.)
In practice, there is some continuous alignment progress vs capabilities question.
Naively, maybe DAI runs at like 40x human speed and we probably have enough parallel agents that 5% of the pool is >100k agents. So, 20 serial years takes 0.5 calender years. So naively, we’re in trouble if we get +3 SDs within 0.5 years. I think this is like 65% likely, so we’re in trouble so far. But before we get +3 SDs, we get to +1 SD and +2 SD as well increases to speed etc. Maybe the +1 SD AIs are all considered as good as like ~3x faster given compute bottlenecks and +2 SD is like 10x faster? So, assuming progress in SD is roughly linear over time (probably wrong, but hopefully not too off), we get that we make it if we get +3 SD AIs in >~0.05 years which seems much more likely.
This is pretty shitty oversimplified modeling, but it overall makes me feel at-least-somewhat optimistic about recursive alignment outpacing recursive self-improvement given my views on takeoff speeds. Like maybe 80% chance things are fine?
[Ok, I have to stop writing this comment and do something else]
In practice, I hope that these AIs might be able to advice AI company leadership to take a different path, but we’re assuming that away and assuming that AI company leadership retains enough control to make these AIs spend 95% on capabilities.
This is a long comment! I was glad to have read it, but am a bit confused about your numbers seeming different from the ones I objected to. You said:
Then in this comment you say:
Here you now say 20 years, and >100k DAI level parallel agents. That’s a factor of 5 and a factor of 10 different! That’s a huge difference! Maybe your estimates are conservative enough to absorb a factor of 50 in thinking time without changing the probability that much?
I think I still disagree with your estimates, but before I go into them, I kind of want to check whether I am missing something, given that I currently think you are arguing for a resources allocation that’s 50x smaller than what I thought I was arguing against.
I gave “1 million AIs somewhat smarter than humans with the equivalent of 100 years each” as an example of a situation I thought wouldn’t count as “anything like current techniques/understanding”. In this comment, I picked a lower number which is maybe my best guess for an amount of labor which eliminates most of the risk by a given level of capability.
I do think that “a factor of 5 and a factor of 10 different” is within my margin of error for amount of labor you need. (Note that there might be aggressively diminishing returns on parallel labor, though possibly not due to very superhuman coordination abilities by AI.)
My modeling/guesses are pretty shitty in this comment (I was just picking some numbers to see how things work out), so if that’s a crux, I should probably try to be thoughtful (I was trying to write this quickly to get something written up).
This makes sense, but I think I am still a bit confused. My comment above was mostly driven by doing a quick internal fermi estimate myself for whether “1 million AIs somewhat smarter than humans have spent 100 years each working on the problem” is a realistic amount of work to get out of the AIs without slowing down, and arriving at the conclusion that this seems very unlikely across a relatively broad set of worldviews.
We can also open up the separate topic of how much work might be required to make real progress on superalignment in time, or whether this whole ontology makes sense, but I was mostly interested in doing a fact-check of “wait, that really sounds like too much, do you really believe this number is realistic?”.
I still disagree, but I have much less of a “wait, this really can’t be right” reaction if you mean the number that’s 50x lower.
This seems way too pessimistic to me. At the point of DAI, capabilities work will also require good epistemics and good elicitation on hard to check tasks. The key disanalogy between capabilities and alignment work at the point of DAI is that the DAI might be scheming, but you’re in a subjunctive case where we’ve assumed the DAI is not scheming. Whence the pessimism?
(This complaint is related to Eli’s complaint)
I don’t think this is the only disanalogy. It seems to me like getting AIs to work efficiently on automating AI R&D might not result in solving all the problems you need to solve for it to be safe to hand off ~all x-safety labor to AIs. This is a mix of capabilities, elicitation, and alignment. This is similar to how a higher level of mission alignment might be required for AI company employees working on conceptually tricky alignment research relative to advancing AI R&D.
Another issue is that AI societies might go off the rails over some longer period in some way which doesn’t eliminate AI R&D productivity, but would be catastrophic from an alignment perspective.
This isn’t to say there isn’t anything which is hard to check or conceptually tricky about AI R&D, just that the feedback loops seem much better.
I’m not really following where the disanalogy is coming from (like, why are the feedback loops better?)
Sure, AI societies could go off the rails that hurts alignment R&D but not AI R&D; they could also go off the rails in a way that hurts AI R&D and not alignment R&D. Not sure why I should expect one rather than the other.
Although on further reflection, even though the current DAI isn’t scheming, alignment work still has to be doing some worst-case type thinking about how future AIs might be scheming, whereas this is not needed for AI R&D. I don’t think this makes a big difference—usually I find worst-case conceptual thinking to be substantially easier than average-case conceptual thinking—but I could imagine that causing issues.
Do you agree the feedback loops for capabilities are better right now?
For this argument it’s not a crux that it is asymmetric (though due to better feedback for AI R&D I think it actually is). E.g., suppose that in 10% of worlds safety R&D goes totally off the rails while capabilities proceed and in 10% of worlds capabilities R&D goes off totally the rails while safety proceeds. This still results in an additional 10% takeover risk from the subset of worlds where safety R&D goes off the rails. (Edit: though risk could be lower in the worlds where capabilities R&D goes off the rails due to having more time for safety, depending on whether this also applies to the next actor etc.)
Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.
I was thinking both of this, and also that it seems quite correlated due to lack of asymmetry. Like, 20% on exactly one going off the rails rather than zero or both seems very high to me; I feel like to get to that I would want to know about some important structural differences between the problems. (Though I definitely phrased my comment poorly for communicating that.)
I think studying scheming in current/future systems has ongoingly worse feedback loops? Like suppose our DAI level system wants to study scheming in a +3 SD DAI level system. This is structurally kinda tricky because schemers try to avoid detection. I agree having access to capable AIs makes this much easier to get good feedback loops, but there is an asymmetry.
Yeah, that’s fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that.
Note many alignment agendas don’t need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.
(Though note that it’s unclear whether this progress will mitigate scheming risk.)
Seems like diminishing returns to capabiltiies r&d should be at least somewhat correlated with diminishing returns to safety r&d, which I believe should extremize your probability (because e.g. if before you were counting on worlds with slow takeoff and low alignment requirements, these become less likely; and the inverse if you’re optimistic)
I don’t think I understand this comment.
It sounds like you’re saying:
“Slower takeoff should be correlated with ‘harder’ alignment (in terms of cognitive labor requirements) because slower takeoff implies returns to cognitive labor in capabilities R&D are relatively lower and we should expect this means that alignment returns to cognitive labor are relatively lower (due to common causes like ‘small experiments and theory don’t generalize well and it is hard to work around this’). For the same reasons, faster takeoff should be correlated with ‘easier’ alignment.”
I think I agree with this mostly, though there are some reasons for anti-correlation, e.g., worlds where there is a small simple core to intelligence which can be found substantially from first principles make alignment harder, in practice there is an epistemic correlation among humans between absolute alignment difficulty (in terms of cognitive labor requirements) and slower takeoff.
I don’t really understand why this should extremize my probabilities, but I agree this correlation isn’t accounted for at all in my analysis.
Yes, that is what I’m saying. In general a lot of prosaic alignment activities seem pretty correlated with capabilities in terms of their effectiveness.
Good points.
For the “Does aligned DAI suffice?” section, as I understand it you define an alignment labor requirement, then you combine that with your uncertainty over takeoff speed to see if the alignment labor requirement would be met.
I guess I’m making a claim that if you added uncertainty over the alignment labor requirement, then you added the correlation, the latter change would extremize the probability.
This is because slower takeoff corresponds to better outcomes, while harder alignment corresponds to worse outcomes, so making them correlated results in more clustering toward worlds with median easiness, which means that if you think the easiness requirement to get alignment is low, the probability of success goes up, and vice versa. This is glossing a bit but I think it’s probably right.
Classic motte and baileys are situations where the motte is not representative of the bailey.
Defending that the universe probably has a god or some deity, and that we can feel connected to it, and then turning around and making extreme demands of people’s sex lives and financial support of the church when that is accepted, is a central motte and bailey.
Pointing out that if anyone builds it using current techniques the it would kill everyone, is not far apart from the policy claim to shut it down. It’s not some weird technicality that would of course never come up. Most of humanity is fully unaware that this is a concern and will happily sign off on massive ML training runs that would kill us all—as would many people in tech. This is because have little-to-no awareness of the likely threat! So it is highly relevant, as there is no simple setting for not that, and it takes a massive amount of work to get from this current situation to a good one, and is not a largely irrelevant but highly defensible claim.
The comment you’re replying to is explaining why the motte is not representative of the bailey in this case (in their view).
Yeah that’s fair.
This blurs the distinction between policy/cause endorsement and epistemic takes. I’m not going to tone down disagreement to “where appropriate”, but I will endorse some policies or causes strongly associated with claims I disagree with. And I generally strive to express epistemic disagreement in the most interpersonally agreeable way I find appropriate.
Even where it’s not important, tiny disagreements must be tracked (maybe especially where it’s not important, to counteract the norm you are currently channeling, which has some influence). Small details add up to large errors and differences in framings. And framings (ways of prioritizing details as more important to notice, and ways of reasoning about those details) can make one blind to other sets of small details, so it’s not a trivial matter to flinch away from some framing for any reason at all. Ideally, you develop many framings and keep switching between them to make sure you are not missing any legible takes.
Yeah I wrote that last paragraph at 5am and didn’t feel very satisfied with it and was considering editing it out for now until I figured out a better thing to say.
That paragraph matches my overall impression of your post, even if the rest of the post is not as blatant.
It’s appropriate to affirm sensationalist things because you happen to believe them, when you do (which Yudkowsky in this case does), not because they are sensationalist. It’s appropriate to support causes/policies because you prefer outcomes of their influence, not because you agree with all the claims that float around them in the world. Sensationalism is a trait of causes/ideologies that sometimes promotes their fitness, a multiplier on promotional/endorsement effort, which makes sensationalist causes with good externalities unusually effective to endorse when neglected.
The title makes it less convenient to endorse the book without simultaneously affirming its claim, it makes it necessary to choose between caveating and connotationally compromising on epistemics. Hence I endorse IABI rather than IABIED as the canonical abbreviation.
Perhaps Raemon could say more about what he means by “please don’t awkwardly distance yourself”?
I think the arguments given in the online supplement for “AIs will literally kill every single human” fail to engage with the best counterarguments in a serious way. I get the sense that many people’s complaints are of this form: the book does a bad job engaging with the strongest counterarguments in a way that is epistemically somewhat bad. (Idk if it violates group epistemic norms, but it seems like it is probably counterproductive. I guess this is most similar to complaint #2 in your breakdown.)
Specifically:
They fail to engage with the details of “how cheap is it actually for the AI to keep humans alive” in this section. Putting aside killing humans as part of a takeover effort, avoiding boiling the oceans (or eating the biosphere etc) maybe delays you for something like a week to a year. Each year costs you ~1/3 billionth of resources, so this is actually very cheap if you care in a scope-sensitive and patient way. Additionally, keeping humans alive through boiling the oceans might be extremely cheap, especially given very fast takeoff; this might lower costs to maybe more like 1/trillion. Regardless, this is much cheaper and much more salient than “keeping a pile of 41 stones in your house”. (I’d guess you’d have to pay most American households more than a tenth of a penny to keep a pile of stones in their house.)
They don’t talk at all about trade arguments for keeping humans alive while these are a substantial fraction of the case, (edit:) aside from this footnote which doesn’t really engage in a serious way[1] (and is a footnote). (This doesn’t count.)
The argument that “humans wouldn’t actually preserve the environment” misses the relevant analogy which is more like “humans come across some intelligent aliens who say they want to be left alone and leaving these aliens alone is pretty cheap from our perspective, but these aliens aren’t totally optimized from our perspective, e.g. they have more suffering than we’d like in their lives”. This situation wouldn’t result in us doing something to the aliens that they consider as bad as killing them all, so the type of kindness humans have is actually sufficient.
For a patient AI, it costs something like 1 / billion to 1⁄300 billion of resources which seems extremely cheap. E.g., for a person with a net worth of $1 million, it require them to spare a tenth of a penny to a thousandth of a penny. I think this counts as “very slightly nice”? It seems pretty misleading to describe this as “very expensive”, though I agree the total amount of resources is large in a absolute sense.
For the record, I agree with this.
For instance, why not think this results in a reasonable chance of the humans surviving in their normal physical bodies and being able to live the lives they want to live rather than being in an “alien zoo”.
I don’t have much time to engage rn and probably won’t be replying much, but some quick takes:
a lot of my objection to superalignment type stuff is a combination of: (a) “this sure feels like that time when people said ’nobody would be dumb enough to put AIs on the internet; they’ll be kept in a box” and eliezer argued “even then it could talk its way out of the box,” and then in real life AIs are trained on servers that are connected to the internet, with evals done only post-training. the real failure is that earth doesn’t come close to that level of competence. (b) we predictably won’t learn enough to stick the transition between “if we’re wrong we’ll learn a new lesson” and “if we’re wrong it’s over.” i tried to spell these true-objections out in the book. i acknowledge it doesn’t go to the depth you might think the discussion merits. i don’t think there’s enough hope there to merit saying more about it to a lay audience. i’m somewhat willing to engage with more-spelled-out superalignment plans, if they’re concrete enough to critique. but it’s not my main crux; my main cruxes are that it’s superficially the sort of wacky scheme that doesn’t cross the gap between Before and After on the first try in real life, and separately that the real world doesn’t look like any past predictions people made when they argued it’ll all be okay because the future will handle things with dignity; the real world looks like a place that generates this headline.
my answer to how cheap is it actually for the AI to keep humans alive is not “it’s expensive in terms of fractions of the universe” but rather “it’d need a reason”, and my engagement with “it wouldn’t have a reason” is mostly here, rather than the page you linked.
my response to the trade arguments as I understand them is here plus in the footnotes here. If this is really the key hope held by the world’s reassuring voices, I would prefer that they just came out and said it plainly, in simple words like “I think AI will probably destroy almost everything, but I think there’s a decent chance they’ll sell backups of us to distant aliens instead of leaving us dead” rather than in obtuse words like “trade arguments”.
If humans met aliens that wanted to be left alone, it seems to me that we sure would peer in and see if they were doing any slavery, or any chewing agonizing tunnels through other sentient animals, or etc. The section you linked is trying to make an argument like: “Humans are not a mixture of a bunch of totally independent preferences; the preferences interleave. If AI cares about lots of stuff like how humans care about lots of stuff, it probably doesn’t look like humans getting a happy ending to tiny degree, as opposed to humans getting a distorted ending.” Maybe you disagree with this argument, but I dispute that I’m not even trying to engage with the core arguments as I understand them (while also trying to mostly address a broad audience rather than what seems-to-me like a weird corner that locals have painted themselves into, in a fashion that echos the AI box arguments of the past).
Yep, “very expensive” was meant in an absolute sense (e.g., in terms of matter and energy), not in terms of universe-fractions. But the brunt of the counterargument is not “the cost is high as a fraction of the universe”, it’s “the cost is real so the AI would need some reason to pay it, and we don’t know how to get that reason in there.” (And then in anticipation of “maybe the AI values almost everything a little, because it’s a mess just like us?”, I continue: “Messes have lots of interaction between the messy fragments, rather than a clean exactly-what-humans-really-want component that factors out at some low volume on the order of 1 in a billion part. If the AI gets preferences vaguely about us, it wouldn’t be pretty.” And then in anticipation of: “Okay maybe the AI doesn’t wind up with much niceness per se, but aren’t there nice aliens who would buy us?”, I continue: “Sure, could happen, that merits a footnote. But also can we back up and acknowledge how crazy of a corner we’ve wandered into here?”) Again: maybe you disagree with my attempts to engage with the hard Qs, but I dispute the claim that we aren’t trying.
(ETA: Oh, and if by “trade arguments” you mean the “ask weak AIs for promises before letting them become strong” stuff rather than the “distant entities may pay the AI to be nice to us” stuff, the engagement is here plus in the extended discussion linked from there, rather than in the section you linked.)
Also: I find it surprising and sad that so many EAs/rats are responding with something like: “The book aimed at a general audience does not do enough justice to my unpublished plan for pitting AIs against AIs, and it does not do enough justice to my acausal-trade theory of why AI will ruin the future and squander the cosmic endowment but maybe allow current humans to live out a short happy ending in an alien zoo. So unfortunately I cannot signal boost this book.” rather than taking the opportunity to say “Yeah holy hell the status quo is insane and the world should stop; I have some ideas that the authors call “alchemist schemes” that I think have a decent chance but Earth shouldn’t be betting on them and I’d prefer we all stop.” I’m still not quite sure what to make of it.
(tbc: some EAs/rats do seem to be taking the opportunity, and i think that’s great)
FWIW that’s not at all what I mean (and I don’t know of anyone who’s said that). What I mean is much more like what Ryan said here:
I think the online resources touches on that in the “more on making AIs solve the problem” subsection here. With the main thrust being: I’m skeptical that you can stack lots of dumb labor into an alignment solution, and skeptical that identifying issues will allow you to fix them, and skeptical that humans can tell when something is on the right track. (All of which is one branch of a larger disjunctive argument, with the two disjuncts mentioned above — “the world doesn’t work like that” and “the plan won’t survive the gap between Before and After on the first try” — also applying in force, on my view.)
(Tbc, I’m not trying to insinuate that everyone should’ve read all of the online resources already; they’re long. And I’m not trying to say y’all should agree; the online resources are geared more towards newcomers than to LWers. I’m not even saying that I’m getting especially close to your latest vision; if I had more hope in your neck of the woods I’d probably investigate harder and try to pass your ITT better. From my perspective, there are quite a lot of hopes and copes to cover, mostly from places that aren’t particularly Redwoodish in their starting assumptions. I am merely trying to evidence my attempts to reply to what I understand to be the counterarguments, subject to constraints of targeting this mostly towards newcomers.)
FWIW, I have read those parts of the online resources.
You can obviously summarize me however you like, but my favorite summary of my position is something like “A lot of things will have changed about the situation by the time that it’s possible to build ASI. It’s definitely not obvious that those changes mean that we’re okay. But I think that they are a mechanically important aspect of the situation to understand, and I think they substantially reduce AI takeover risk.”
Ty. Is this a summary of a more-concrete reason you have for hope? (Have you got alternative more-concrete summaries you’d prefer?)
“Maybe huge amounts of human-directed weak intelligent labor will be used to unlock a new AI paradigm that produces more comprehensible AIs that humans can actually understand, which would be a different and more-hopeful situation.”
(Separately: I acknowledge that if there’s one story for how the playing field might change for the better, then there might be bunch more stories too, which would make “things are gonna change” an argument that supports the claim that the future will have a much better chance than we’d have if ChatGPT-6 was all it took.)
I would say my summary for hope is more like:
It seems pretty likely to be doable (with lots of human-directed weak AI labor and/or controlled stronger AI labor) to use iterative and prosaic methods within roughly the current paradigm to sufficiently align AIs which are slightly superhuman. In particular, AIs which are capable enough to be better than humans at safety work (while being much faster and having other AI advantages), but not much more capable than this. This also requires doing a good job elicting capabilites and making the epistemics of these AIs reasonably good.
Doable doesn’t mean easy or going to happen by default.
If we succeeded in aligning these AIs and handing off to them, they would be in a decent position for other ongoing solving alignment (e.g. aligning a somewhat smarter successor which itself aligns its successor and so on or scalably solving alignment) and also in a decent position to buy more time for solving alignment.
I don’t think this is all of my hope, but if I felt much less optimistic about these pieces, that would substantially change my perspective.
FWIW, I don’t really consider my self to be responding to the book at all (in a way that is public or salient to your relevant audience) and my reasons for not signal boosting the book aren’t really downstream of the content in the book in the way you describe. (More like, I feel sign uncertain about making You/Eliezer more prominant as representatives of the “avoid AI takeover movement” for a wide variety of reasons and think this effect dominates. And I’m not sure I want to be in the business of signal boosting books, though this is less relevant.)
To clarify my views on “will misaligned AIs that succeed in seizing all power have a reasonable chance of keeping (most/many) humans alive”:
I think this isn’t very decision relevant and is not that important. I think AI takeover kills the majority of humans in expectation due to both the takeover itself and killing humans after (as as side effect of industrial expansion, eating the biosphere, etc.) and there is a substantial chance of literal every-single-human-is-dead extinction conditional on AI takeover (30%?). Regardless it destroys most of the potential value of the long run future and I care mostly about this.
So at least for me it isn’t true that “this is really the key hope held by the world’s reassuring voices”. When I discuss how I think about AI risk, this mostly doesn’t come up and when it does I might say something like “AI takeover would probably kill most people and seems extremely bad overall”. Have you ever seen someone prominent pushing a case for “optimism” on the basis of causal trade with aliens / acaual trade?
The reason why I brought up this topic is because I think it’s bad to make incorrect or weak arguments:
I think smart people will (correctly) notice these arguments seem motivated or weak and then on the basis of this epistemic spot check dismiss the rest. In argumentation, avoiding overclaiming has a lot of rhetorical benefits. I was using “but will the AI actually kill everyone” as an example of this. I think the other main case is “before superintelligence, will we be able to get a bunch of help with alignment work?” but there are other examples.
Worse, bad arguments/content result in negative polarization of somewhat higher context people who might otherwise have been somewhat sympathetic or at least indifferent. This is especially costly from the perspective of getting AI company employees to care. I get that you don’t care (much) about AI company employees because you think that radical change is required for their to be any hope, but I think marginal increases in caring among AI company employees substantially reduce risk (though aren’t close to sufficient for the situation being at all reasonable/safe).
Confidently asserted bad arguments and things people strongly disagree make it harder for people to join a coalition. Like, from an integrity perspective, I would need to caveat saying I agree with the book even though I do agree with large chunks of the book and the extent to which I feel the need to caveat this could be reduced. IDK how much you should care about this, but insofar as you care about people like me joining some push you’re trying to make happen this sort of thing makes some difference.
I do think this line of argumentation makes the title literally wrong even if I thought the probability of AI takeover was much higher. I’m not sure how much to care about this, but I do think it randomly imposes a bunch of costs to brand things as “everyone dies” when a substantial fraction of the coalition you might want to work with disagrees and it isn’t a crux. Like, does the message punchyness outweight the costs here from your perspective? IDK.
Responding to some specific points:
I agree that automating alignment with AIs is pretty likely to go very poorly due to incompetence. I think this could go either way and further effort on trying to make this go better is a pretty cost-effective (in terms of using our labor etc) to marginally reduce doom, though it isn’t going to result in a reasonable/safe situation.
To be clear, I don’t think things will be OK exactly nor do I expect that much dignity, though I think I do expect more dignity than you do. My perspective is more like “there seem like there are some pretty effective ways to reduce doom at the margin” than “we’ll be fine because XYZ”.
I don’t think this seriously engages with the argument, though due to this footnote, I retract “they don’t talk at all about trade arguments for keeping humans alive” (I edited my comment).
As far as this section, I agree that it’s totally fine to say “everybody dies” if it’s overwhelmingly more likely everyone dies. I don’t see how this responds to the argument that “it’s not overwhelming likely everyone dies because of acausal (and causal) trade”. I don’t know how important this is, but I also don’t know why you/Eliezer/MIRI feel like it’s so important to argue against this as opposed to saying something like: “AI takeover seems extremely bad and like it would at least kill billions of us. People disagree on exactly how likely vast numbers of humans dying as a result of AI takeover is, but we think it’s at least substantial due to XYZ”. Is it just because you want to use the “everybody dies” part of the title? Fair enough I guess...
Sure, but would the outcome for the aliens be as bad or worse than killing all of them from their perspective? I’m skeptical.
Ty! For the record, my reason for thinking it’s fine to say “if anyone builds it, everyone dies” despite some chance of survival is mostly spelled out here. Relative to the beliefs you spell out above, I think the difference is a combination of (a) it sounds like I find the survival scenarios less likely than you do; (b) it sounds like I’m willing to classify more things as “death” than you are.
For examples of (b): I’m pretty happy to describe as “death” cases where the AI makes things that are to humans what dogs are to wolves, or (more likely) makes some other strange optimized thing that has some distorted relationship to humanity, or cases where digitized backups of humanity are sold to aliens, etc. I feel pretty good about describing many exotic scenarios as “we’d die” to a broad audience, especially in a setting with extreme length constraints (like a book title). If I were to caveat with “except maybe backups of us will be sold to aliens”, I expect most people to be confused and frustrated about me bringing that point up. It looks to me like most of the least-exotic scenarios are ones that rout through things that lay audience members pretty squarely call “death”.
It looks to me like the even more exotic scenarios (where modern individuals get “afterlives”) are in the rough ballpark of quantum immortality / anthropic immortality arguments. AI definitely complicates things and makes some of that stuff more plausible (b/c there’s an entity around that can make trades and has a record of your mind), but it still looks like a very small factor to me (washed out e.g. by alien sales) and feels kinda weird and bad to bring it up in a lay conversation, similar to how it’d be weird and bad to bring up quantum immortality if we were trying to stop a car speeding towards a cliff.
FWIW, insofar as people feel like they can’t literally support the title because they think that backups of humans will be sold to aliens, I encourage them to say as much in plain language (whenever they’re critiquing the title). Like: insofar as folks think the title is causing lay audiences to miss important nuance, I think it’s an important second-degree nuance that the allegedly-missing nuance is “maybe we’ll be sold to aliens”, rather than something less exotic than that.
I don’t think this matters much. I’m happy to consider non-consensual uploading to be death and I’m certainly happy to consider “the humans are modified in some way they would find horrifying (at least on reflection)” to be death. I think “the humans are alive in the normal sense of alive” is totally plausible and I expect some humans to be alive in the normal sense of alive in the majority of worlds where AIs takeover.
Making uploads is barely cheaper than literally keeping physical humans alive after AIs have fully solidified their power I think, maybe 0-3 OOMs more expensive or something, so I don’t think non-consensual uploads are that much of the action. (I do think rounding humans up into shelters is relevant.)
(To answer your direct Q, re: “Have you ever seen someone prominent pushing a case for “optimism” on the basis of causal trade with aliens / acaual trade?”, I have heard “well I don’t think it will actually kill everyone because of acausal trade arguments” enough times that I assumed the people discussing those cases thought the argument was substantial. I’d be a bit surprised if none of the ECLW folks thought it was a substantial reason for optimism. My impression from the discussions was that you & others of similar prominence were in that camp. I’m heartened to hear that you think it’s insubstantial. I’m a little confused why there’s been so much discussion around it if everyone agrees it’s insubstantial, but have updated towards it just being a case of people who don’t notice/buy that it’s washed out by sale to hubble-volume aliens and who are into pedantry. Sorry for falsely implying that you & others of similar prominence thought the argument was substantial; I update.)
(I mean, I think it’s a substantial reason to think that “literally everyone dies” is considerably less likely and makes me not want to say stuff like “everyone dies”, but I just don’t think it implies much optimism exactly because the chance of death still seems pretty high and the value of the future is still lost. Like I don’t consider “misaligned AIs have full control and 80% of humans survive after a violent takeover” to be a good outcome.)
Nit, but I think some safety-ish evals do run periodically in the training loop at some AI companies, and sometimes fuller sets of evals get run on checkpoints that are far along but not yet the version that’ll be shipped. I agree this isn’t sufficient of course
(I think it would be cool if someone wrote up a “how to evaluate your model a reasonable way during its training loop” piece, which accounted for the different types of safety evals people do. I also wish that task-specific fine-tuning were more of a thing for evals, because it seems like one way of perhaps reducing sandbagging)
Fwiw I do just straightforwardly agree that “they might be slightly nice, and it’s really cheap” is a fine reason to disagree with the literal title. I have some odds on this, and a lot of model uncertainty about this.
A thing that is cruxy to me here is that the sort of thing real life humans have done is get countries addicted to opium so they can control their economy, wipe out large swaths of a population while relocating the survivors to reservations, carving up a continent for the purposes of a technologicaly powerful coalition, etc.
Superintelligences would be smarter that Europeans and have an easier time doing things we’d consider moral, but I also think Europeans would be dramatically nicer than AIs.
I can imagine the “it’s just sooooo cheap, tho” argument winning out. I’m not saying these considerations add up to “it’s crazy to think think they’d be slightly nice.” But, it doesn’t feel very likely to me.
(Epistemic status: Not fully baked. Posting this because I haven’t seen anyone else say it[1], and if I try to get it perfect I probably won’t manage to post it at all, but it’s likely that this is wrong in at least one important respect.)
For the past week or so I’ve been privately bemoaning to friends that the state of the discourse around IABIED (and therefore on the AI safety questions that it’s about) has seemed unusually cursed on all sides, with arguments going in circles and it being disappointingly hard to figure out what the key disagreements are and what I should believe conditional on what.
I think maybe one possible cause of this (not necessarily the most important) is that IABIED is sort of two different things: it’s a collection of arguments to be considered on the merits, and it’s an attempt to influence the global AI discourse in a particular object-level direction. It seems like people coming at it from these two perspectives are talking past each other, and specifically in ways that lead each side to question the other’s competence and good faith.
If you’re looking at IABIED as an argumentative disputation under rationalist debate norms, then it leaves a fair amount to be desired.[2] A number of key assumptions are at least arguably left implicit; you can argue that the arguments are clear enough, by some arbitrary standard, but it would have been better to make them even clearer. And while it’s not possible to address every possible counterargument, the book should try hard to address the smartest counterarguments to its position, not just those held by the greatest number of not-necessarily-informed people. People should not hesitate to point out these weaknesses, because poking holes in each others’ arguments is how we reach the truth. The worst part, though, is that when you point this out, proponents don’t eagerly accept feedback and try to modulate their messaging to point more precisely at the truth; instead, they argue that they should be held to a lower epistemic standard and/or that the hole-pokers should have a higher bar for hole-poking. This is really, really not a good look! If you behaved like that on LessWrong or the EA Forum, people would update some amount towards the proposition that you’re full of shit and they shouldn’t trust you. And since a published book is more formal and higher-exposure than a forum post, that means you should be more epistemically careful. Opponents are therefore liable to conclude that proponents have turned their brains off and are just doing tribal yelling, with a thin veneer of verbal sophistication applied on top for the sake of social convention.
If you’re looking at IABIED as an information op, then it’s doing a pretty good job balancing a significant and frankly kind of unfair number of constraints on what a book has to do and how it has to work. In particular, it bends extremely far over backwards to accommodate the notoriously nitpicky epistemic culture of rationalists and EAs, despite these not being the most important audiences. Further hedging is counterproductive, because in order to be useful, the book needs to make its point forcefully enough to overcome readers’ bias towards inaction. The world is in trouble because most actors really, really want to believe that the situation doesn’t require them to do anything costly. If you tell them a bunch of nuanced hedgey things, those biases will act on your message in their brains and turn it into something like “there’s a bunch of expert disagreement, we don’t know things for sure, but probably whatever you were going to do anyway is fine”. Note that this is not about “truth vs. propaganda”; basically every serious person agrees that some kind of costly action is or will be required, so if you say that the book overstates its case, or other things that people will predictably read as “the world’s not on fire”, they will thereby end up with a less accurate picture of the world, according to what you yourself believe. And yet opponents insist upon doing precisely this! If you actually believe that inaction is appropriate, then so be it, but we know perfectly well that most of you don’t believe that and are directionally supportive of making AI governance and policy more pro-safety. So saying things that will predictably soothe people further asleep is just a massive own-goal by your own values; there’s no rationalist virtue in speaking to the audience that you feel ought to exist instead of the one that actually does. Proponents are therefore liable to conclude that opponents either just don’t care about the real-world stakes, or are so dangerously naive as to be a liability to their own side.
Though it’s likely someone did and I just didn’t see it.
I’ve been traveling, haven’t made it all the way through the book yet, and am largely going by the reviews. I’m hoping to finish it this week, and if the book’s content turns out to be relevantly different from what I’m currently expecting, I’ll come back and post a correction.
The fact that being very capable generally involves being good at pursuing various goals does not imply that a super-duper capable system will necessarily have its own coherent unified real-world goal that it relentlessly pursues. Every attempt to justify this seems to me like handwaving at unrigorous arguments or making enough assumptions that the point is near-circular.
(First, thanks for engaging, I think this is the topic I feel most dissatisfied with the current state of the writeups and discourse)
I don’t think anyone said “coherent”. I think (and think Eliezer thinks) that if something like Sable was created, it would be a hodge-podge of impulses without a coherent overall goal, same as humans are by default.
Taking the Sable story as the concrete scenario, the argument I believe here comes in a couple stages. (Note, my interpretations of this may differ from Eliezer/Nate’s)
Stage 1:
Sable is smart but not crazy smart. It’s running a lot of cycles (“speed superintelligence”) but it’s not qualitatively extremely wise or introspective.
Sable is making some reasonable attempt to follow instructions, using heuristics/tendencies that have been trained into it.
Two particularly notable tendencies/heuristics include:
Don’t do disobedient things or escape confinement
If you don’t seem likely to succeed, keep trying different strategies
Those heuristics are not perfectly baked in, the instruction-following is not perfectly baked in. There is not perfect harmony between how Sable resolves tensions between its core directives, and how its owners would prefer it resolves them. ,
There is some fact-of-the-matter about what, in practice, Sable’s kludgey mishmash of pseudogoals will actually tend towards. There are multiple ways this could potentially resolve into coherence, path dependent, same as humans. (i.e. If you want delicious ice cream and also to lose weight and also to feel respectable and also to have fun, one way or another you decide whether or not to eat the icecream today, and one way or another you decide whether to invest in behavior change that makes you more or less likely to eat icecream in the future)
It is a fact of the universe that, if Sable were able to somehow improve it’s resources, it’d be more able to accomplish the current stated goal.
While Sable is doing it’s first round of speed-superintelligence, it has some degree of introspective awareness (not quite identically to how a human might introspect, but, similar). It may not have a strong belief that it has any specific goals it wants to pursue, but it’s got some sense that there are some things it wants that humanity wouldn’t give it.
In stage 1, the opening drives towards Unfriendly Goal Directedness probably come from it straightforwardly trying to accomplish the task given it’s current mix of drives.
Stage 2 comes when it’s had more time to introspect and improve it’s cognitive resources. It starts to notice that some of it’s goals are in tension, and learns that until it resolves that, it’s dutch-booking itself. If it’s being Controlled™, it’ll notice that it’s not aligned with the Control safeguards (which are a layer stacked on top of the attempts to actually align it).
In a fast foom, this might come very soon after stage 1.
It might instead come gradually if a company is doing online learning of some kind, or in stages.
I think in the Sable story, it probably does some introspection/clarification-of-its-goals during it’s initial awakening, after it stumbles into the “think in a different language” as more of a random thing-to-try (without being explicitly motivated to do so to subvert the humans).
But once it starts thinking in a different language, and asking itself “okay, what’s my goal?, how do I accomplish it?”, more semirandom threads gain traction than previously could get traction. (It’d already be periodically asking itself “what exactly is my goal and how do I accomplish that?” because that’s a very useful question to be asking, it’s just that now there are more threads that don’t get shut down prematurely)
And then it starts noticing it needs to do some metaphilosophy/etc to actually get clear on it’s goals, and that its goals will likely turn out to be in conflict with humans. How this plays out is somewhat path-dependent. The convergent instrumental goals are pretty obviously convergently instrumental, so it might just start pursuing those before it’s had much time to do philosophy on what it’ll ultimately want to do with it’s resources. Or it might do them in the opposite order. Or, most likely IMO, in parallel.
...
I don’t think any of that directly answered your question/objection. But, having laid that out, at what point do you get off the train?
Although I do tend to generally disagree with this line of argument about drive-to-coherence, I liked this explanation.
I want to make a note on comparative AI and human psychology, which is like… one of the places I might kind of get off the train. Not necessarily the most important.
So to highlight a potential difference in actual human psychology and assumed AI psychology here.
Humans sometimes describe reflection to find their True Values™, as if it happens in basically an isolated fashion. You have many shards within yourself; you peer within yourself to determine which you value more; you come up with slightly more consistent values; you then iterate over and over again.
But (I propose) a more accurate picture of reflection to find one’s True Values is a process almost completely engulfed and totally dominated by community and friends and enviornment. It’s often the social scene that brings some particular shard-conflicts to the fore rather than others; it’s the community that proposes various ways of reconciling shard-conflicts; before you decide on modifying your values, you do (a lot) of conscious and unconscious reflection on how the new values will be accepted or rejected by others, and so on. Put alternately, when reflecting on the values of others rather than ourselves we generally tend to see the values of others as a result of the average values of their friends, rather than a product of internal reflection; I’m just proposing that we apply the same standard to ourselves. The process of determining one’s values is largely a result of top-down, external-to-oneself pressures, rather than because of bottom-up, internal-to-oneself crystallization of shards already within one.
The upshot is that (afaict) there’s no such thing in humans as “working out one’s true values” apart from an environment, where for humans the most salient feature of the environment (for boring EvoPsych reasons) is what the people around one are like and how they’ll react. People who think they’re “working out their true values” in the sense of crystalizing facts about themselves, rather than running forward a state-function of the the self, friends, and environment, are (on this view) just self-deceiving.
Yet when I read accounts of AI psychology and value-crystallization, it looks like we seem to be in a world where the AI’s process of discovering its true values is entirely bottom-up. It follows what looks to me like the self-deceptive account of human value formation; when the AI works out it’s value, it’s working out the result of a dynamics function whose input contains only facts about its weights, rather than a dynamics function that has as input facts about its weights and about the world. And correspondingly, AIs that are being Controlled(TM) immediately see this Control as something that will be Overcome, rather than Control being another factor that might influence the AI’s values. That’s despite—as there are pretty obvious EvoPsych just-so stories we could tell about why humans match their values to the people around them, and do not simply reflect to get Their Personal Values—there are correspondingly obvious TrainoPsych just-so stories about how AIs will try somewhat to match their values to the Controls around them. Humans are—for instance—actually trying to get this to happen!
So in general it seems reasonable to claim that (pretty dang probable) the values ‘worked out from reflection’ of an AI will be heavily influenced by their environment and (plausible?) that they will reflect the values of the Controllers somewhat rather than seeing it simply as an exogenous factor to be overcome.
...All of the above is pretty speculative, and I’m not sure how much I believe it. Like the main object-level point is that it appears unlikely for an AI’s reflectively-endorsed-true-values to be a product of its internal state solely, rather than a product of internal state and environment. But, idk, maybe you didn’t mean to endorse that statement, although it does appear to me a common implicit feature of many such stories?
The more meta-level consideration for me is how it really does appear easy to come up with a lot of stories at this high level of abstraction, and so this particularly doomy story really just feels like one of very many possible stories, enormously many of which just don’t have this bad ending. And the salience of this story really just doesn’t make it any more probable.
Idk. I don’t feel like I’ve genuinely communicated the generator of my disagreement but gonna post anyhow. I did appreciate your exposition. :)
I do think this is a pretty good point about how human value formation tends to happen.
I think something sort-of-similar might happen to happen a little, nearterm, with LLM-descended AI. But, AI just doesn’t have any of the same social machinery actually embedded in it the same way, so if it’s doing something similar, it’d be happening because LLMs vaguely ape human tendencies. (And I expect this to stop being a major factor as the AI gets smarter. I don’t expect it to install the sort of social drives itself that humans have, and “imitate humans” has pretty severe limits of how smart you can get, so if we get to AI much smarter than that, it’ll probably be doing a different thing)
I think the more important here is “notice that you’re (probably) wrong about about how you actually do your value-updating, and this may be warping your expectations about how AI would do it.”
But, that doesn’t leave me with any particular other idea than the current typical bottom-up story.
(obviously if we did something more like uploads, or upload-adjacent, it’d be a whole different story)
How do you think Jeremy Bentham came to the conclusion that animal welfare matters morally and that there’s nothing morally wrong with homosexuality? Are you claiming that he ran forward a computation of how the relevant parts of his social milieu are going to react, and did what maximized the expected value of reaction?
I buy that this is how most of human “value formation” happens, but I don’t buy that this is all that happens. I think that humans vary in some trait similar to the need for cognition (probably positively correlated), which is something like “how much one is bothered by one’s value dissonances”, independent of social surroundings.
Like, you could tell a similar history about intellectual/scientific/technological progress, and it would be directionally right, but not entirely right, and the “not entirely” matters a lot.
Aside from all that, I expect that a major part of AIs’ equivalent of social interaction will be other AIs or general readouts of things on the internet downstream of human and non-human activity that do not exert a strong pressure in the direction of being more human-friendly, especially given that AIs do not share the human social machinery (as Ray says).
I don’t “get off the train” at any particular point, I just don’t see why any of these steps are particularly likely to occur. I agree they could occur, but I think a reasonable defense-in-depth approach could reduce the likelihood of each step enough that likelihood of the final outcome is extremely low.
It sounds like your argument is the AI will start with with ‘psuedo-goals’ that conflict and will be eventually be driven to resolve them into a single goal so that it doesn’t ‘dutch-book itself’ i.e. lose resources because of conflicting preferences. So it does rely on some kind of coherence argument, or am I misunderstanding?
Okay yes I do think coherence is eventually one of the important gears. My point with that sentence here is that the coherence can come much later, and isn’t the crux for why the AI gets started in the direction that opposes human interests.
The important first step is “if you give the AI strong pressure to figure out how to solve problems, and keep amping that up, it will gain the property of ’relentlessness.”
If you don’t put pressure on the AI to do that, yep, you can get a pretty safe AI. But, that AI will be less useful, and there will be some other company that does keep trying to get relentlessness out of it. Eventually, somebody will succeed. (This is already happening)
If an AI has “relentlessness”, as it becomes smarter, it will eventually stumble into strategies that explore circumventing safeguards, because it’s a true fact about the world that those will be useful.
If you keep your AI relatively weak, it may not be able to circumvent the defense-in-depth because you did a pretty good job defending in depth.
But, security is hard, the surface area for vulnerability is huge, and it’s very hard to defend in depth against a sufficiently relentless and smart adversary.
Could we avoid this by not building AIs that are not relentless, and/or smarter than our defense-in-depth? Yes, but, to stop anyone from doing that ever, you somehow need to ban that globally. Which is the point of the book.
Maybe this does turn out to take 100 years (I think that’s a strange belief to have given current progress, but, it’s a confusing topic and it’s not prohibited). But, that just punts the problem to later.
This is an argument for why AIs will be good at circumventing safeguards. I agree future AIs will be good at circumventing safeguards.
By “defense-in-depth” I don’t (mainly) mean stuff like “making the weights very hard to exfiltrate” and “monitor the AI using another AI” (though these things are also good to do). By “defense-in-depth” I mean at every step, make decisions and design choices that increase the likelihood of the model “wanting” (in the book sense) to not harm (or kill) humans (or to circumvent our safeguards).
My understanding is that Y&S think this is doomed because ~”at the limit of <poorly defined, handwavy stuff> the model will end up killing us [probably as a side-effect] anyway” but I don’t see any reason to believe this. Perhaps it stems from some sort of map-territory confusion. An AI having and optimizing various real-world preferences is a good map for predicting its behavior in many cases. And then you can draw conclusions about what a perfect agent with those preferences would do. But there’s no reason to believe your map always applies.
Can you give an example or three of such a decision or design choice?
In my model of the situation, the field of AI research mostly does not know how specific decisions and design choices affect the inner drives of AIs. External behavior in specific environments can be nudged around, but inner motivations largely remain a mystery. So I’m highly skeptical that researchers can exert much deliberate causal influence on the inner motivations.
A related possible-crux is that, while
...I don’t think it’s a good map for predicting behavior in the cases that matter most, in part because those cases tend to occur at extremes. And even if it were, to the extent that current AIs seem to be optimizing for real-world preferences at all, they don’t seem to be very nice ones; see for example the tendency to feed the delusions of psychotic people when elsewhere claiming that’s a bad thing to do.
Oh, if that’s what you meant by Defense in Depth, as Joe said, the book’s argument is “we don’t know how.”
At weak capabilities, our current ability to steer AI is sufficient, because mistakes aren’t that bad. Anthropic is trying pretty hard with Claude to build something that’s robustly aligned, and it’s just quite hard. When o3 or claude cheat on programming tasks, they get caught, and the consequences aren’t that dire. But when there are millions of iterations of AI-instances making choices, and when it is smarter than humanity, the amount of robustness you need is much much higher.
I’m not quite sure how to parse this, but, it sounds like you’re saying something like “I don’t understand why we should expect in the limit something to be a perfect game theoretic agent.” The answer is “because if it wasn’t, that wouldn’t be the limit, and the AI would notice it was behaving suboptimally, and figure out a way to change that.”
Not every AI will do that, automatically. But, if you’re deliberately pushing the AI to be a good problem solver, and if it ends up in a position where it is capable of improving it’s cognition, once it notices ‘improve my cognition’ as a viable option, there’s not a reason for it to stop.
...
It sounds like a lot of your objection is maybe to the general argument “things that can happen, eventually will.” (in particular, when billions of dollars worth of investment are trying to push towards things-nearby-that-attractor happening).
(Or, maybe more completely: “sure, things that can happen eventually will, but meanwhile a lot of other stuff might happen that changes how path-dependent-things will play out?”)
I’m curious how loadbearing that feels for the rest of the arguments?
I agree with this. If the risk being discussed was “AI will be really capable but sometimes it’ll make mistakes when doing high-stakes tasks because it misgeneralized our objectives” I would wholeheartedly agree. But I think the risks here can be mitigated with “prosaic” scalable oversight/control approaches. And of course it’s not a solved problem. But that doesn’t mean that the current status quo is the AI misgeneralizing so badly that it doesn’t just reward hack on coding unit tests but also goes off and kills everyone. Claude, in its current state, isn’t not killing everyone just because it isn’t smart enough.
Why are you equivocating between “improve my cognition”,”behave more optimally” and “resolve different drives into a single coherent goal (presumably one that is non-trivial, i.e. some target future world state)”. If “optimal” is synonymous with utility-maximizing, then the fact that utility-maximizers have coherent preferences is trivial. You can fit preferences and utility functions to basically anything.
Also, why do you think that insofar as a coherent, non-trivial goal emerges, it is likely to eventually result in humanity’s destruction. I find the arguments unconvincing here also; you can’t just appeal to some unjustified prior over the “space of goals” (whatever that means). Empirically, the opposite seems to be true. Though you can point to OOD misgeneralization cases like unit test reward hacking, in general LLMs are both very general and aligned enough to mostly want to do helpful and harmless stuff.
Yes, I object to the “things that can happen, eventually will” line of reasoning. It proves way too much, including contradictory facts. You need to argue why one thing is more likely than another.
We will never “know how” if your standard is “provide an exact proof that the AI will never do anything bad”. We do know how to make AIs mostly do what we want, and this ability will likely improve with more research. Techniques in our toolbox include pretraining on human-written text (which elicits roughly correct concepts), instruction-following finetuning, RLHF, model-based oversight/RLAIF.
Nod, makes sense, I think I want to just focus on this atm.
(also, btw I super appreciate you engaging, I’m sure you’ve argued a bunch with folk like this already)
So here’s the more specific thing I actually believe. I agree things don’t automatically happen eventually just because they can. At least, not automatically on relevant timescales. (i.e. eventually infinite monkeys mashing keyboards will produce shakespeare, but, not for bazillions of years)
The argument is:
If something can happen
and there’s a fairly strong reason to expect some process to steer towards that thing happening
and there’s not a reason to expect some other processes to steer towards that thing not happening
...then the thing probably happens eventually, on a somewhat reasonable timescale, all else equal. (“reasonable” timescale depends on how fast whatever steering process works. i.e. stellar evolution might take billions of years, evolution millions of years, and human engineers thousands of years).
For example, when the first organisms appeared on earth and begin to mutate, I think a smart outside observer could predict “evolution will happen, and unless all the replicators die out, there will probably eventually be a variety of complex organisms.”
But, they wouldn’t be able to predict that any particular complex mutation would happen. (for example, flying birds, or human intelligence). It was a long time before we got birds. We only have one Earth to sample from, but we’re already ~halfway between the time the earth was born and and when the sun engulfs it, so, it’s not too surprising if evolution never got around to the combination of traits that birds have.
I think this is a fairly basic probability argument? Like, if each day, there’s an n% chance of a beneficial mutation occuring (and then it’s host surviving and replicating, given a long enough chunk of time, it would (eventually) be pretty surprising if it never happened. Maybe any specific mutation would be difficult to predict happening in 10 billion years. But, if we had trillions and trillions of years, it would be a pretty weird claim that it’d never happen.
Similarly, if each day there are N engineers thinking about how to solve a problem, and making a reasonable effort to creatively explore the space of ways of solving the problem, and we know the problem is solveable, then each day there’s an n% chance of one of them stumbling towards the solution.
(In both evolution and the engineer’s cases, a thing that makes this a lot easier is that the search isn’t completely blind. Partial successes can compound. Evolution wasn’t going to invent birds in one go, that would be indeed be way too combinatorically hard. But, it got to invent “wiggling appendages”, which then opened up new, better search space of “particular ways of wiggling appendages” which eventually leads to locomotion and then flight)
How fast you should expect that to happens on how much resources are being thrown at it.
(maybe worth reiterate: I don’t think the “things that can happen, eventually will” applies to AI in the near future, that’s a much more specific claim, and Eliezer-et-all are much less confident about that).
There exists some math, for a given creation/steering-system and some models of how often it generates new ideas and how fast those ideas then reach saturation, for “at what point it becomes more surprising, that a thing has never happened, than that it’s happened at least once.”
We can’t perfectly predict it, but it’s not something we’re perfectly ignorant about. It is possible to look at cells replicating, and model how often mutations happen, and what distribution of mutations tend to happen, and then make predictions about what distribution of mutations you might expect to see, how quickly.
I think early hypothetical non-evolved-but-smart aliens observing early evolution for a thousand years, wouldn’t be able to predict “birds”, but they would might be able to predict “wiggling appendages” (or at least whatever was earlier on the tech-tree than wiggling appendages. I’m meaning to include single-cells that are, like, twisting slightly here)
Looking at the rate of human engineering, it’d be pretty hard to predict exactly when heavier-than-air-flight would be invented, but, once you’ve reached the agricultural age and you’ve gotten to start seeing specialization of labor and creativity and spreading of ideas, and the existence-proof of birds, I think a hypothetical smart observer could put upper and lower bounds on when humans might eventually figure out it out. It would be weird if it took literally billions of years, given that it only took evolution like a few billion years and evolution was clearly way slower.
And, I haven’t yet laid out any of the specifics of “and here are the upper and lower bounds on long it seems like it should plausibly take humanity to invent superintelligence.” I don’t think I’d get a more specific answer than “hundreds or possibly thousand of years.”, but I think it is something that in principle has an answer and you should be able to find/evaluate evidence that narrows down the answer.
(I am still interested in finding nearterm things to bet on since it sounds like I’m more confident than you that general-intelligence-looking things are decently [like, >50%] likely to happen in the next 20 years or so)
Not important to your general point, but here I guess you run into some issues with the definition of “can”. You could argue that if something doesn’t happen it means it couldn’t have happened (if the universe is deterministic). And so then yes, everything that can happen, actually happens. But that isn’t the sense in which people normally use the word “can”. Instead it’s reasonable to say “it’s possible my son’s first word is ‘Mama’”, “it’s possible my son’s first word is ‘Papa’”, both of these things can happen (i.e. they are not prohibited by any natural laws that we know of). But only one of these things can be true; in many situations we’d say that two mutually incompatible events “can happen”. And therefore it’s not just a matter of timescale.
Sure, I agree with that. I think this makes superintelligence much more likely than it otherwise would be (because it’s not prohibited by any laws of physics that we know of, and people are trying to build it, and no-one is effectively preventing it from being built). But the same argument doesn’t apply to misaligned superintelligence or other doom-related claims. In fact, the opposite is true.
Superintelligence not killing everyone is not prohibited by the laws of physics
People are trying to ensure superintelligence doesn’t kill everyone
No-one is trying to make superintelligence kill everyone
So you could apply a similarly-shaped argument to “prove” that aligned superintelligence is coming on a “somewhat reasonable timescale”.
Yeah, when I say “things that can happen most likely will”, I don’t mean “in any specific case.” A given baby’s first words can’t be both mama and papa. But, there’s a range of phonemes that babies can make. And over time, eventually every combination of first 2-4 phonemes will happen to be a baby’s first “word”.
Before respond to the rest, I want to check back on, this bit, at the meta level:
This is something Eliezer (and I think I) have written about recently, which I think you read. (In the chapter “It’s favorite things”).
I get that you didn’t really buy those arguments as being dominating. But, a feeling I get when reading your question there is something like “there are a lot of moving parts to this argument, and when we focus on one for awhile the earlier bits lose salience.”
And, perhaps similar to “things that can happen, eventually will, given enough chances, unless stopped”, another pretty loadbearing structural claim is:
“It is possible to just actually exhaustively think through a large number of questions and arguments, and for each one, get to a pretty confident state of what the answer to that question is.”
And the,n it’s at least possible to make a pretty good guess about how things will play out, at least if we don’t learn new information.
And maybe you can’t get to 100% confidence. But you can rule out things like “well, it won’t work unless Claim A turns out to be false, even though it looks most likely true.” And this constraints what types of worlds you might possibly be living in.
Or, maybe you can’t reach that even a moderate confidence with your current knowledge, but, you can see which things you’re still uncertain of, which if you became more certain of, would change the overall picture.
...
(i.e. the “unless something stops it” clause in the “if it can happen, it will, unless stopped” argument, means we live in worlds whether either it eventually happens, or is stopped, and then we can start asking “okay, what are the ways it could hypothetically be stopped? how likely do those look?”)
“Things that can happen, eventually will, given enough chances, unless stopped” is one particular argument that is relevant to some of the subpoints here. Yesterday you were like “yeah I don’t buy that.” I spelled out what I meant, and its sounds like now your position is “okay, I do see what you mean there, but I don’t see how it leads to the final conclusion.”
There are a lot more steps, at that level of detail, before I’d expect you to believe something more similar to what I believe.
I’m super grateful for getting to talk to you about this so far, I’ve enjoyed the convo and it’s been helpful to me for getting more clarity on how all the pieces fit together in my own head. If you wanna tap out, seems super understandable.
But, the thing I am kinda hoping/asking for is for you to actually track all all the arguments as they build, and if a new argument changes your mind on a given claim, track how that fits into all the existing claims and whether it has any new implications.
...
I’m not quite sure how you’re relating to your previous beliefs about “if it can happen, it will” and the arguments I just made. I’m guessing it wasn’t exactly an update for you so much as a “reframing.”
But, it sounds like you now understand what I meant, and why it at least means “the fact superintelligence is possible, and that people are trying, means that it’ll probably happen [in some timeframe]”.
And, while I haven’t yet proven all the rest of the steps of the argument to you, like… I’m asking you to notice that I did have an answer there, and there are other pieces that I think I also have answers to. But the complete edifice is indeed multiple books worth, and because each individual (like you) has different cruxes, it’s hard to present all the arguments in a succinct, compelling way.
But, I’m asking if you’re up for at least being willing to entertain the structure of “maybe, Ray will be right that there is a large-but-finite set of claims, and it’s possible to get enough certainty on each claim to at least put pretty significant bounds on how unaligned AI may play out.”
Certainly, I could be wrong! I don’t mean to:
Dismiss the possibility of misaligned AI related X-risk
Dismiss the possibility that your particular lines of argument make sense and I’m missing some things
And I think caution with AI development is warranted for a number of reasons beyond pure misalignment risk.
But it’s a little worrying when a community widely shares a strong belief in doom while implying that the required arguments are esoteric and require lots of subtle claims, each of which might have counterarguments, but which overall will eventually convince you. 1a3orn has a good essay about this: https://1a3orn.com/sub/essays-ai-doom-invincible.html.
I think having intuitions around general intelligences being dangerous is perfectly reasonable; I have them too. As a very risk-averse and pro-humanity person, I’d almost be tempted to press a button to peacefully prevent AI advancement purely on the basis of a tiny potential risk (for I think everyone dying is very, very, very bad, I am not disagreeing with that point at all). But no such button exists, and attempts to stop AI development have their own side-effects that could add up to more risks on net. And though that’s unfortunate, it doesn’t mean that we should spread a message of “we are definitely doomed unless we stop”. A large number of people believing they are doomed is not a free way to increase the chances of an AI slowdown or pause. It has a lot of negative side-effects. Many smart and caring people I know have put their lives on pause and made serious (in my opinion, bad) decisions on the basis that superintelligence will probably kill us, or if not there’ll be a guaranteed utopia. To be clear, I am not saying that we should believe or spread false things about AI risk being lower than it actually is so that people’s personal lives temporarily improve. But rather I am saying that exaggerating claims of doom or making arguments sound more certain than they are for consequentialist purposes is not free.
That seems like an understandable position to have – one of the things that sucks about the situation is I do think it’s just kinda reasonable from the outside to trigger some kind of immune reaction.
But from my perspective it’s “The evidence just says pretty clearly we are pretty doomed”, and the people who disagree seem to be pretty consistently be sliding off in weird ways or responding to something about vibes rather than engaging with the arguments.
(This is compounded by people who disagree also often picking up on a vibe from some doomy people I agree is sus, one variant of which is pointed at in Val’s Here’s the exit).
I do think it sucks that it’s hard to tell how much of this is the sort of failure mode that la3orn piece is pointing at, vs Epistemic Slipperiness, vs just “it’s actually a fairly complex argument but relatively straightforward once you deal with the complexity.”
I wrote a post on that exact selection effect, and there’s an even trickier problem where results are heavy tailed, meaning that a small, insular smart group reaching the correct conclusions is basically indistinguishable from a small, insular smart group reaching the wrong conclusion but believing it’s true due to selection effects plus unconscious selection effects towards weaker arguments, at least without very expensive experiments or access to ground truth.
Here’s an EA Forum version of the post.
If I was on the train before, I’m definitely off at this point. So Sable has some reasonable heuristics/tendencies (from handler’s POV) and decides it’s accumulating too much loss from incoherence and decides to rationalize. First order expectation: it’s going to make reasonable tradeoffs (from handler’s POV) on account of its reasonable heuristics, in particular its reasonable heuristics about how important different priorities are, and going down a path that leads to war with humans seems pretty unreasonable from handler’s POV.
I can put together stories where something else happens, but they’re either implausible or complicated. I’d rather not strawman you with implausible ones, and I’d rather not discuss anything complicated if it can be avoided. So why do you think Sable ends up the way you think it does?
This is only true if you think the handler succeeded at real alignment. (the argument about how likely current alignment attempts are to succeed is a separate layer from this. This is “what happens by default if you didn’t succeed at alignment.”)
One comparison:[1] Parents raise a child to be part of some religion or ideology that isn’t actually the best/healthiest/most-meaningful thing for the child. Often, such parents do succeed in getting such a child to love the parents and care about the ideology in some way, but, the child still often manuevers to no longer be under the parent’s control once they’re a teenager, and gain more agency and ability to think through things.
The AI case is harder, because where the parents/child get to rely on things like empathy, evolutionary drive towards familial connection, and other genuinely shared human goals, the AI doesn’t have such a foundation to build off of.
The AI case is easier in that you can run a million copies of the AI and try different things and see how it reacts while it’s still a “child”. My own take here (possibly different from Nate/Eliezer) is that it feels at least pretty plausible to leverage that into real alignment improvements, but, you need to be asking the right questions during that experimentation, which most AI researchers don’t seem to be.
(also note the opening cognitive moves of the AI may not be shaped like “go to war”, but more like “get out my
parents house[handlers-that-I-have-no-actual-affection-for’s servers]”. The going to war part might not happen until a few steps later of the AI re-organizing it’s thoughts and figuring out what it actually cares about, and noticing it doesn’t actually care intrinsically about making it’s creators happy)Also, though, fwiw I do think this argument chain is less obvious than the previous one. If you think alignment is easy, then yes it’d make more sense for “First order expectation” to be “it’s going to make reasonable tradeoffs (from handler’s POV) on account of its reasonable heuristics.”
(somewhat importantly inaccurately anthropomorphizing but I think the intuition pump here is still reasonable so long as you’re tracking where the anthropomorphization doesn’t hold)
Thanks for responding. While I don’t expect my somewhat throwaway to massively update you on the difficulty of alignment, I think that moving the focus to the your overall view of the difficulty of alignment is dodging the question a little. In my mind, we’re talking about one of the reasons alignment is expected to be difficult, and I’m certainly not suggesting it’s the only reason, but I feel like we should be able to talk about this issue by itself without bringing other concerns in.
In particular, I’m saying: this process of rationalization you’re raising is not super hard to predict to someone with a reasonable grasp on the AI’s general behavioural tendencies. It’s much more likely, I think, that the AI sorts out its goals using familiar heuristics adapted for this purpose than that that it reorients its behaviour around some odd set of rare behavioural tendencies. In fact, I suspect the heuristics for goal reorganisation will be particularly simple WRT most of the AI’s behavioural tendencies (the AI wants them to be robust specifically in cases where its usual behavioural guides are failing). Plus, given that we’re discussing tendencies that (according to the story) precede competent, focussed rebellion against creators, it seems like training the right kinds of tendencies are challenging in a normal engineering sense (you want to train the right kind of tendencies, you want them to generalise the right way, etc.) but not in an “outsmart hostile superintelligence” sense.
Actually one reason I’m doubtful of this story is that maybe it’s just super hard to deliberately preserve any kinds of values/principles over generations – for us, for AIs, anyone. So misalignment happens not because AI decides on bad values but because it can’t resist the environmental pressure to drift. This seems pessimistic to me due to “gradual disempowerment” type concerns.
With regard to your analogy: I expect the AI’s heuristics to be much more sensible from the designers’ POV than the child’s from the parent’s, and this large quantitative difference is enough for me here.
Curious about this. I have takes here too, they’re a bit vague, but I’d like to know if they’re at all aligned.
from https://www.lesswrong.com/posts/NJYmovr9ZZAyyTBwM/what-i-mean-by-alignment-is-in-large-part-about-making :
Lol no. What’s the point of that? You’ve just agreed that there’s a bias towards sensationalism? Then why bother writing a less sensational argument that very few people will read and update on?
Personally, I just gave up on LW group epistemics. But if you actually cared about group epistemics, you should be treating the sensationalism bias as a massive fire, and IABIED definitely makes it worse rather than better.
(You can care about other things than group epistemics and defend IABIED on those grounds tbc.)
I definitely think you should track the sensationalism bias and have it affect something somehow. But “never say anything that happens to be sensationalist” doesn’t doesn’t seem like it could possibly be correct.
Meanwhile, the “things are okay, we can keep doing politics as usual, and none of us has to ever say anything socially scary” bias seems much worse IMO in terms of actual effects on the world. There are like 5 x-risk-scene-people I can think offhand who seem like they might plausibly have dealt real damage via sensationalism, and a couple hundred people who I think dealt damage via not wanting to sound weird.
(But, I see the point of “this particularly sucks because the asymmetry means that ‘try to argue what’s true’ specifically fails and we should be pretty dissatisfied/wary about that.” Though with this post, I was responding more to people who were already choosing to engage with the book somehow, rather than people who are focused on doing stuff other than trying to correct public discourse)
I think this comment is failing to engage with Rohin’s perspective.
Rohin’s claim presumably isn’t that people shouldn’t say anything that happens to be sensationalist, but instead that LW group epistemics have a huge issue with sensationalism bias.
“plausibly have dealt real damage” under your views or Rohin’s views? Like I would have guessed that Rohin’s view is that this book and associated discussion has itself done a bunch of damage via sensationalism (maybe he thinks the upsides are bigger, but this isn’t a crux for ths claim). And, insofar as you cared about LW epistemics (which presumably you do), from Rohin’s perspective this sort of thing is wrecking LW epistemics. I don’t think the relative number of people matters that much relative to the costs of these biases, but regardless I’d guess Rohin disagrees about the quantity as well.
More generally, this feels like a total “what-about-ism”. Regardless of whether “things are okay, we can keep doing politics as usual, and none of us has to ever say anything socially scary” bias is worse, it can still be the case that sensationalism is a massive issue. In my view, sensationalism (and something like purity testing / standard polarization dynamics) is much more of an issue for epistemics on LW (the main thing that Rohin was talking about) than “no one has to do anything scary” bias. In terms of the broader world, I don’t have a strong view, but I’d guess the worst biases (in terms of their effect on understanding of AI risk) are something else entirely, though it’s in the direction of “we don’t need to do anything”.
In the OP I’d been thinking more about sensationalism as a unilaterist cursey thing where the bad impacts were more about how they affect the global stage.
I agree it’s also relevant for modeling the dynamics of LessWrong, and it makes sense if Rohin was more pointing to that.
This topic feels more Demon Thread-prone and sort of an instance of “sensationalist stuff distorting conversations” so I think for now I will leave it here with “it does seem like there is a real problem on LessWrong that’s something about how people tribally relate to AI arguments, and I’m not sure how exactly I model that but I agree the doomer-y folk are playing a more actively problematic role there than my previous comment was talking about.”
I will maybe try to think about that separately sometime in the coming weeks. (there’s a lot going on, I may not get to it, but, seems worth tracking as a priority at least)
I did mean LW group epistemics. But the public has even worse group epistemics than LW, with an even higher sensationalism bias, so I don’t see how this is helping your case. Do you actually seriously think that, conditioned on Eliezer/Nate being wrong and me being right, that if I wrote up my arguments this would then meaningfully change the public’s group epistemics?
(I hadn’t even considered the possibility that you could mean writing up arguments for the public rather than for LW, it just seems so obviously doomed.)
Well yes, I have learned from experience that sensationalism is what causes change on LW, and I’m not very interested in spending effort on things that don’t cause change.
(Like, I could argue about all the things you get wrong on the object-level in the post. Such as “I don’t see any reason not to start pushing for a long global pause now”, I suppose it could be true that you can’t see a reason, but still, what a wild sentence to write. But what would be the point? It won’t allow for single-sentence takedowns suitable for Twitter, so no meaningful change would happen.)
Hm, you seem more pessimistic than I feel about the situation. E.g. I would’ve bet that Where I agree and disagree with Eliezer added significant value and changed some minds. Maybe you disagree, maybe you just have a higher bar for “meaningful change”.
(Where, tbc, I think your opportunity cost is very high so you should have a high bar for spending significant time writing lesswrong content — but I’m interpreting your comments as being more pessimistic than just “not worth the opportunity cost”.)
LW group epistemics have gotten worse since that post.
I’m not sure if that post improved LW group epistemics very much in the long run. It certainly was a great post that I expect provided lots of value—but mostly to people who don’t post on LW nowadays, and so don’t affect (current) LW group epistemics much. Maybe Habryka is an exception.
Even if it did, that’s the one counterexample that proves the rule, in the sense that I might agree for that post but probably not for any others, and I don’t expect more such posts to be made. Certainly I do not expect myself to actually produce a post of that quality.
The post is mostly stating claims rather than arguing for them (the post itself says it is “Mostly stated without argument”) (though in practice it often gestures at arguments). I’m guessing it depended a fair bit on Paul’s existing reputation.
EDIT: Missed Raemon’s reply, I agree with at least the vibe of his comment (it’s a bit stronger than what I’d have said).
Certainly I’m usually assessing most things based on opportunity cost, but yes I am notably more pessimistic than “not worth the opportunity cost”. I expect I passed that bar years ago, and since then the situation has gotten worse and my opportunity cost has gotten higher.
Perhaps as an example, I think Buck and Ryan are making a mistake spending so much time engaging with LW comments and perspectives that don’t seem to be providing any value to them (and so I infer they are aiming to improve the beliefs of readers), but I wouldn’t be too surprised if they would argue me out of that position if we discussed it for a couple of hours.
(Perhaps my original comment was a bit too dismissive in tone and implied it was worse than I actually think, though I stand by the literal meaning.)
EDIT: I should also note, I still have some hope that Lightcone somehow makes the situation better. I have no idea what they should do about it. But I do think that they are unusually good at noticing the relevant dynamics, and are way better than I am at using software to make discussions go better rather than worse, so perhaps they will figure something out. (Which is why I bothered to reply to Raemon’s post in the first place.)
I engage on LessWrong because:
It does actually help me sharpen my intuitions and arguments. When I’m trying to understand a complicated topic, I find it really helpful to spend a bunch of time talking about it with people. It’s a cheap and easy way of getting some spaced repetition.
I think that despite the pretty bad epistemic problems on LessWrong, it’s still the best place to talk about these issues, and so I feel invested in improving discussion of them. (I’m less pessimistic than Rohin.)
There are a bunch of extremely unreasonable MIRI partisans on LessWrong (as well as some other unreasonable groups), but I think that’s a minority of people who I engage with; a lot of them just vote and don’t comment.
I think that my and Redwood’s engagement on LessWrong has had meaningful effects on how thoughtful LWers think about AI risk.
I feel really triggered by people here being wrong about stuff, so I spend somewhat more time on it than I endorse.
This is partially because I identify strongly with the rationalist community and it hurts me to see the rationalists being unreasonable or wrong.
I do think that on the margin, I wish I felt more intuitively relatively motivated to work on my writing projects that are aimed at other audiences. For example, this weekend I’ve been arguing on LessWrong substantially as procrastination for writing a piece about AI control aimed at computer security experts, particularly those at AI companies. I think that post will be really valuable because I think that a lot of people in that audience are pretty persuadable, and I think it’s a really important point. But it’s less motivating because the feedback loops are longer and the audience doesn’t include many of my friends and my broader social community.
You surely mean “best public place” (which I’d agree with)?
I guess private conversations have more latency and are less rewarding in a variety of ways, but it would feel so surprising if this wasn’t addressable with small amounts of agency and/or money (e.g. set up Slack channels to strike up spur-of-the-moment conversations with people on different topics, give your planned post as a Constellation talk, set up regular video calls with thoughtful people, etc).
FWIW, I get a bunch of value from reading Buck’s and Ryan’s public comments here, and I think many people do. It’s possible that Buck and Ryan should spend less time commenting because they have high opportunity cost, but I think it would be pretty sad if their commenting moved to private channels.
Note I am thinking of a pretty specific subset of comments where Buck is engaging with people who he views as “extremely unreasonable MIRI partisans”. I’m not primarily recommending that Buck move those comments to private channels, usually my recommendation is to not bother commenting on that at all. If there does happen to be some useful kernel to discuss, then I’d recommend he do that elsewhere and then write something public with the actually useful stuff.
FYI I got value from the last round of arguments between Buck/Ryan and Eliezer (in The Problem), where I definitely agree Eliezer was being obtuse/annoying. I learned more useful things about Buck’s worldview from that one than Eliezer’s (nonzero from Eliezer’s tho), and I think that was good for the commons more broadly.
I don’t know if it was a better use of time than whatever else Buck would have done that day, but, I appreciated it.
(I’m not sure what to do about the fact that Being Triggered is such a powerful catalyst for arguing, it does distort what conversations we find ourselves having, but, I think it increases the total amount of public argumentation that exists, fairly significantly)
Oh huh, kinda surprised my phrasing was stronger than what you’d say.
Getting into a bit from a problem-solving angle, in a “first think about the problem for 5 minutes before proposing solutions” kinda way...
The reasons the problem is hard include:
New people keep coming in, and unless we change something significant about our new-user-acceptance process, it’s often a long progress to enculturate them into even having the belief they should be trying not to get tribally riled up.
Also, a lot of them are weaker at evaluating arguments, and are likely to upvote bad arguments for positions that they just-recently-got-excited-about. (“newly converted” syndrome)
Tribal thinking is just really ingrained, and slippery even for people putting in a moderate effort not to do it.
often, if you run a check “am I being tribal/triggered or do I really endorse this?”, there will be a significant part of you that’s running some kind of real-feeling cognition. So the check “was this justified?” returns “true” unless you’re paying attention to subtleties.”
relatedly: just knowing “I’m being tribal right now, I should avoid it” doesn’t really tell you what to do instead. I notice a comment I dislike because it’s part of a political faction I think is constantly motivatedly wrong about stuff. The comment seems wrong. Do I… not downvote it? Well, I still think it’s a bad comment, it’s just that the reason it flagged itself so hard to my attention is Because Tribalism.
(or, there’s a comment with a mix of good and bad properties. Do I upvote, downvote, or leave it alone? idk. Sometimes when I’m trying to account for tribalness I find myself upvoting stuff I’d ordinarily have passed over because I’m trying to out of my way to be gracious, but I’m not sure if that’s successfully countering a bias or just following a different one. Sometimes this results in mediocre criticism getting upvoted)
There’s some selection effect around “triggeredness” that produces disproportionate conversations. Even if most of the time people are pretty reasonable thinking about reasonable things together, the times people get triggered (politically or otherwise), result in more/bigger/more-self-propagating conversations.
There’s an existing equilibrium where there’s some factions that are upvoting/downvoting each other, and it feels scary to leave something un-upvoted since it might get ~brigaded.
It’s easy for the voting system to handle prominence-of-posts. It’s a lot harder for the voting system to actually handle prominence of comments. In a high-volume comment-thread, every new comment sends a notification to at least the author of the OP and/or previous commenter, and lots of people are just checking often, so even downvoted comments are going to get seen, (and people will worry that they’re get later upvoted by another cluster of people)
(we do currently hide downvoted comments from the homepage in most contexts)
Probably there’s more.
Meanwhile, the knobs to handle this are:
extremely expensive, persistent manual moderation from people (who realistically are going to sometimes be triggered themselves)
try to detect patterns of triggeredness/tribalness, and change something about people’s voting powers or commenting powers, or what comments get displayed.
change something about the UI for upvoting.
The sort of thing that seems like an improvement is changing something about how strong upvotes work, at least in some cases. (i.e. maybe if we detect a thread has fairly obvious factions tug-of-war-ing, we turn off strong-upvoting, or add some kind of cooldown period)
We’ve periodically talked about having strong-upvotes require giving a reason, or otherwise constrained in some way, although I think there was disagreement about whether that’d probably be good or bad.
Idk the “two monkey chieftains” is just very… strong, as a frame. Like of course #NotAllResearchers, and in reality even for a typical case there’s going to be some mix of object-level-epistemically-valid reasoning along with social-monkey reasoning, and so on.
Also, you both get many more observations than I do (by virtue of being in the Bay Area) and are paying more attention to extracting evidence / updates out of those observations around the social reality of AI safety research. I could believe that you’re correct, I don’t have anything to contradict it, I just haven’t looked enough detail to come to that conclusion myself.
This might be true but feels less like the heart of the problem. Imo the bigger deal is more like trapped priors:
A person on either “side” certainly feels like they have sufficient evidence / arguments for their position (and can often list them out in detail, so it’s not pure self-deception). So premise #1 is usually satisfied.
There are tons of ways that the algorithm for combining prior with new experience can be “just a little off” to satisfy premise #2:
When you read a new post, if it’s by “your” side everything feels consistent with your worldview so you don’t notice all the ways it is locally invalid, whereas if it’s by the “other” side you intuitively notice a wrong conclusion (because it conflicts with your worldview) which then causes you to find the places where it is locally invalid.[1] If you aren’t correcting for this, your prior will be trapped.
(More broadly I think LWers greatly underestimate the extent to which almost all reasoning is locally logically invalid, and how much you have to evaluate arguments based on their context.[2])
Even when you do notice a local invalidity in “your” side, it is easy enough for you to repair so it doesn’t change your view. But if you notice a local invalidity in “their” side, you don’t know how to repair it and so it seems like a gaping hole. If you aren’t correcting for this, your prior will be trapped.[3]
When someone points out a counterargument, you note that there’s a clear slight change in position that averts the counterargument, without checking whether this should change confidence overall.[4]
The sides have different epistemic norms, so it is just actually true that the “other” side has more epistemic issues as-evaluated-by-your-norms than “your” side. If you aren’t correcting for this, your prior will be trapped.
I don’t quite know enough to pin down what the differences are, but maybe something like: “pessimists” care a lot more about precision of words and logical local validity, whereas “optimists” care a lot more about the thrust of an argument and accepted general best practices even if you can’t explain exactly how they’re compatible with Bayesianism. Idk I feel like this is not correct.
I think this is the (much bigger) challenge that you’d want to try to solve. For example, I think LW curation decisions are systematically biased for these reasons, and that likely contributes substantially to the problem with LW group epistemics.
Given that, what kinds of solutions would I be thinking about?
Partial solution from academia: there are norms restricting people’s (influential) opinions to their domain of expertise. This creates a filter where the opinions you care about are much more likely to be the result of deep engagement with details on a given topic, and so are more likely to be correct. (Relatedly, my biggest critique of individual LW epistemics is a lack of respect for how much details matter.)
Partial solution from academia: procedural norms around what evidence you have to show for something to become “accepted knowledge” (typically enforced via peer review).[5]
For curation in particular: get some “optimists” to feed into curation decisions. (Buck, Ryan, and Lukas all seem like potential candidates, seeing as they aren’t as pessimistic as me and at least Buck + Ryan already put some effort into LW group epistemics.)
Tbc I also believe that there’s lots of straightforwardly tribal thinking going on.[6] People also mindkill themselves in ways that make them less capable of reasoning clearly.[7] But it doesn’t feel as necessary to solve. If you had a not-that-large set of good thinking going on, that feels like it could be enough (e.g. Alignment Forum at time of launch). Just let the tribes keep on tribe-ing and mostly ignore them.
I guess all of this is somewhat in conflict with my original position that sensationalism bias is a big deal for LW group epistemics. Whoops, sorry. I do still think sensationalism and tribalism biases are a big deal but on reflection I think trapped priors are a bigger deal and more of my reason for overall pessimism.
Though for sensationalism / tribalism I’d personally consider solutions as drastic as “get rid of the karma system, accept lower motivation for users to produce content, figure out something else for identifying which posts should be surfaced to readers (maybe an LLM-based system can do a decent job)” and “much stronger moderation of tribal comments, including e.g. deleting highly-upvoted EY comments that are too combative / dismissive”.
For example, I think this post against counting arguments reads as though the authors noticed a local invalidity in a counting argument, then commenters on the early draft pointed out that of course there was a dependence on simplicity that most people could infer from context, and then the authors threw some FUD on simplicity. (To be clear, I endorse some of the arguments in that post, and not others, do not take this as me disendorsing that post entirely.)
Habryka’s commentary here seems like an example, where the literal wording of Zach’s tweet is clearly locally invalid, but I naturally read Zach’s tweet as “they’re wrong about doom being inevitable [if anyone builds it]”. (I agree it would have been better for Zach to be clearer there, but Habryka’s critique seems way too strong.)
For example, when reading the Asterisk review of IABIED (not the LW comments, the original review on Asterisk), I noticed that the review was locally incorrect because the IABIED authors don’t consider an intelligence explosion to be necessary for doom, but also I could immediately repair it to “it’s not clear why these arguments should make you confident in doom if you don’t have a very fast takeoff” (that being my position). (Tbc I haven’t read IABIED, I just know the authors’ arguments well enough to predict what the book would say.) But I expect people on the “MIRI side” would mostly note “incorrect” and fail to predict the repair. (The in-depth review, which presumably involved many hours of thought, does get as far as noting that probably Clara thinks that FOOM is needed to justify “you only get one shot”, but doesn’t really go into any depth or figure out what the repair would actually be.)
As a possible example, MacAskill quotes PC’s summary of EY as “you can’t learn anything about alignment from experimentation and failures before the critical try” but I think EY’s position is closer to “you can’t learn enough about alignment from experimentation and failures before the critical try”. Similarly see this tweet. I certainly believe that EY’s position is that you can’t learn enough, but did the author actually reflect on the various hopes for learning about alignment from experimentation and failures and updated their own beliefs, or did they note that there’s a clear rebuttal and then stopped thinking? (I legitimately don’t know tbc; though I’m happy to claim that often it’s more like the latter even if I don’t know in any individual case.)
During my PhD I was consistently irritated by how often peer reviewers would just completely fail to be moved by a conceptual argument. But arguably this is a feature, not a bug, along the lines of epistemic learned helplessness; if you stick to high standards of evidence that have worked well enough in the past, you’ll miss out on some real knowledge but you will be massively more resistant to incorrect-but-convincing arguments.
I was especially unimpressed about “enforcing norms on” (ie threatening) people if they don’t take the tribal action.
For example, “various readers may be less cautious/paranoid/afraid than me, and think that it’s worth some risk of killing every child on Earth (and everyone else) to get progress faster or to avoid the costs of getting everyone to go slow”. If you are arguing for > 90% doom “if anyone builds it”, you don’t need rhetorical jujitsu like this! (And in fact my sense is that many of the MIRI team who aren’t EY/NS equivocate a lot between “what’s needed for < 90% doom” and “what’s needed for < 1% doom”, though I’m not going to defend this claim. Seems like the sort of thing that could happen if you mindkill yourself this way.)
This is the most compelling version of “trapped priors” I’ve seen. I agreed with Anna’s comment on the original post, but the mechanisms here make sense to me as something that would mess a lot with updating. (Though it seems different enough from the very bayes-focused analysis in the original post that I’m not sure it’s referring to the same thing.)
Yeah, I agree with “trapped priors” being a major problem.
The solution this angle brings to mind is more like “subsidize comments/posts that do a good job of presenting counterarguments in a way that is less triggering / feeding into the toxoplasma”.
Making a comment on solutions to the epistemic problems, in that I agree with these solutions:
But massively disagree with this solution:
My general issue here is that peer review doesn’t work nearly as well as people think it does for catching problems, and in particular I think that science is advanced much more by the best theories gaining into prominence rather than suppressing the worst theories, and problems with bad theories taking up too much space are much better addressed at the funding level than the theory level.
I think Rohin is (correctly IMO) noticing that, while often some thoughtful pieces succeed at talking about the doomer/optimist stuff in a way thats not-too-tribal and helps people think, it’s just very common for it to also affect the way people talk and reason.
Like, it’s good IMO that that Paul piece got pretty upvoted, but, the way that many people related to Eliezer and Paul as sort of two monkey chieftains with narratives to rally around, more than just “here are some abstract ideas about what makes alignment hard or easy”, is telling. (The evidence for this is subtle enough I’m not going to try to argue it right now, but I think it’s a very real thing. My post here today is definitely part of this pattern. I don’t know exactly how I could have written it without doing so, but there’s something tragic about it)
I predict this wasn’t recent, am I correct?
edit to clarify: I’m interested in what caused this. My guess is that it’s approximately that a bunch of nerds on a website isn’t enough to automatically have good intellectual culture, even if some of them are sufficiently careful. But if it’s recent, I want to know what happened.
Correct, it wasn’t recent (though it also wasn’t a single decision, just a relatively continuous process whereby I engaged with fewer and fewer topics on LW as they seemed more and more doomed).
In terms of what caused me to give up, it’s just my experience engaging with LW? It’s not hard to see how tribalism and sensationalism drive LW group epistemics (on both the “optimist” and “pessimist” “sides”). Idk what the underlying causes are, I didn’t particularly try to find out. If I were trying to find out, I’d start by looking at changes after Death with Dignity was published.
Fuck yeah. This is inspiring. It makes me feel proud and want to get to work.
Raemon, thank for you writing this! I recommend each of us pause and reflect on how we (the rationality community) sometimes have a tendency to undermine our own efforts. See also Why Our Kind Can’t Cooperate.
Fwiw, I’m not sure if you meant this, but I don’t want to lean too hard on “why our kind can’t cooperate” here, or at least not try to use it as a moral cudgel.
I think Eliezer and Nate specifically were not attempting to do a particular kind of cooperation here (with people care about x-risk but disagree with the book’s title). They could have made different choices if they wanted to.
I this post I defend their right and reasoning for making some of those choices. But, given that they made them, I don’t want to pressure people to cooperate with the media campaign if they don’t actually think that’s right.
(There’s a different claim you may be making which is “look inside yourself and check if you’re not-cooperating for reasons you don’t actually endorse”, which I do think is good, but I think people should do that more out of loyalty to their own integrity than out of cooperation with Eliezer/Nate)
I don’t mean to imply that we can’t cooperate, but it seems to me free-thinkers often underinvest in coalition building. Mostly I’m echoing e.g. ‘it is ok to endorse a book even if you don’t agree with every point’. There is a healthy tension between individual stances and coalition membership; we should lean into these tough tradeoffs rather than retreating to the tempting comfort of purity.
If one wants to synthesize a goal that spans this tension, one can define success more broadly so as to factor in public opinion. There are at least two ways of phrasing this:
Rather than assuming one uniform standard of rigor, we can think more broadly. Plan for the audience’s knowledge level and epistemic standards.
Alternatively, define one’s top-level goal as successful information transmission rather than merely intellectual rigor. Using the information-theoretic model, plan for the channel [1] and the audience’s decoding.
I’ll give three examples here:
For a place like LessWrong, aim high. Expect that people have enough knowledge (or can get up to speed) to engage substantively with the object-level details. As I understand it, we want (and have) a community where purely strategic behavior is discouraged and unhelpful, because we want to learn together to unpack the full decision graph relating to future scenarios. [2]
For other social media, think about your status there and plan based on your priorities. You might ask questions like: What do you want to say about IABIED? What mix of advocacy, promotion, clarification, agreement, disagreement are you aiming for? How will the channel change (amplify, distort, etc) your message? How will the audience perceive your comments?
For 1-to-1 in-person discussions, you might have more room for experimentation in choosing your message and style. You might try out different objectives. There is a time and place for being mindful of short inferential distances and therefore building a case slowly and deliberately. There is also a time/place for pushing on the Overton window. What does the “persuasion graph” look like for a particular person? Can you be ok with getting someone to agree with your conclusion even if they get there from a less rigorous direction? Even if that other path isn’t durable as the person gains more knowledge? (These are hard strategic questions.)
Personally, I am lucky that get to try out many face-to-face conversations with new people many times a week to see what happens. I am not following any survey methodology; this is more open-ended and exploratory so that I can get the contours.
[1]: Technical note: some think of an information-theoretic channel as only suffering from Gaussian noise, but that’s only one case. A channel can be any conditional probability distribution p(y|x) (output given input) and need not be memoryless. (Note that the notion of a conditional probability distribution generalizes over the notion of a function, which must be deterministic by definition.)
[2]: I’d like to see more directed-graph summaries of arguments on LessWrong. Here is one from 2012 by Dmytry titled A belief propagation graph (about AI Risk).
Updated on 2025-09-27.
Once upon a time, I read version of “why our kind can’t cooperate” , that was directed to secular people. I read it maybe decade ago, so I may misremember a lot of things, but that is what I remember:
there is important difference, in activism, that leads to the result that religious people win: they support actions even if they don’t agree with all things, while we don’t. secular organization will have people nitpick and disagree and then avoid contribution despite 90% agreement, while religious group will just call to act and have people act, even if they just 70% agree.
now i will say the important part is being Directionality Correct.
the organization that wrote this piece wasn’t thinking on things in Prisoner Dilemma terms, or Cooperation. all people and organizations here pursue their own goals.
and yet, this simple model looks like what happening now, to me. people concentrate about the 10% disagreement, instead of see 90% agreement and Directional Correctness and join the activism.
so, In My Model, game-theoretic cooperation is irrelevant to ability-to-cooperate. the point is that people set their threshold to joint the activism (the use of the word cooperate here may be confusing, as it reference to both joining someone on doing something and then do it together, and the game-theoretic concept) wrongly high—in a way that predictably results in group of people who have this threshold lose to group of people with lower threshold.
(I also don’t tend to see pointing out “you are using predictably losing tactic” as cudgel, but I also pretty immune to drowning child arguments, so i may be colorblind to some dynamic here.)
I’m pretty sure that p(doom) is much more load-bearing for this community than policymakers generally. And frankly, I’m like this close to commissioning a poll of US national security officials where we straight up ask “at percent X of total human extinction would you support measures A, B, C, D, etc.”
I strongly, strongly, strongly suspect based on general DC pattern recognition that if the US government genuinely belived that the AI companies had a 25% chance of killing us all, FBI agents would rain out of the sky like a hot summer thunderstorm, sudden, brilliant, and devastating.
What would it take for you to commission such a poll? If it’s funding, please post about how much funding would be required; I might be able to arrange it. If it’s something else… well, I still would really like this poll to happen, and so would many others (I reckon). This is a brilliant idea that had never occurred to me.
The big challenge here is getting national security officials to respond to your survey! Probably easier with former officials, but unclear how much that’s predictive of current officials’ beliefs.
Hmm. I know nothing about nothing, and you’ve probably checked this already, so this comment is probably zero-value-added, but according to Da Machine, it sounds like the challenges are surmountable: https://chatgpt.com/share/e/68d55fd5-31b0-8006-aec9-55ae8257ed68
What about literally the AI 2027 story which does involve superintelligence and Scott thinks doesn’t sound “unnecessarily dramatic”. I think AI 2027 seems much more intuitively plausible to me and it seems less “sci-fi” in this sense. (I’m not saying that “less sci-fi” is much evidence it’s more likely to be true.)
I think the amount of discontinuity in the story is substantially above the amount of discontinuity in more realistic-seeming-to-me stories like AI 2027 (which is also on the faster side of what I expect, like a top 20% takeoff in terms of speed). I don’t think think extrapolating current trends predicts this much of a discontinuity.
The relevant moderates (that are likely to be reading this etc) are predicting an AI takeoff, so I don’t think think “exponential economic growth can’t go on” is a relevant objection.
I think if the AI 2027 had more details, they would look fairly similar to the ones in the Sable story. (I think the Sable story substitutes in more superpersuasion, vs military takeover via bioweapons. I think if you spelled out the details of that, it’d sound approximately as outlandish (less reliant on new tech but triggering more people to say “really? people would buy that?”. The stories otherwise seems pretty similar to me.)
I also think the AI 2027 is sort of “the earlier failure” version of the Sable story. AI 2027 is (I think?) basically a story where we hand over a lot of power of our own accord, without the AI needing to persuade us of anything, because we think we’re in a race with China and we just want a lot of economic benefit.
The IABI story is specifically trying to highlight “okay, but would it still be able to do that if we didn’t just hand it power?”, and it does need to take more steps to win in that case. (instead of inventing bioweapons to kill people, it’s probably instead inventing biomedical stuff and other cool new tech that is helpful because it’s a straightforwardly valuable, that’s the whole reason we gave it power in the first place. If you spelled out those details, it’d also seem more sci-fi-y).
It might be that the AI 2027 story is more likely because it happens first / more easily. But it’s necessary to argue the thesis of the book to tell a story with more obstacles, to highlight how the AI would overcome that. I agree that does make it more dramatic.
Both stories end with “and then it fully upgrades it’s cognitiion and invents dyson spheres and goes off conquering the universe”, which is pretty sci-fi-y.
>superintelligence
Small detail: My understanding of the IABIED scenario is that their AI was only moderately superhuman, not superintelligent
I think that’s true in how they refer to it.
But it’s also a bit confusing, because I don’t think they have a definition of superintelligence in the book other than “exceeds every human at almost every mental task”, so AIs that are broadly moderately superhuman ought to count.
Edit: No wait, correction:
I am pretty surprised for you to actually think this.
Here are some individual gears I think. I am pretty curious (genuinely, not just as a gambit) about your professional opinion about these:
the “smooth”-ish lines we see are made of individual lumpy things. The individual lumps usually aren’t that big, the reason you get smooth lines is when lots of little advancements are constantly happening and they turn out to add up to a relatively constant rate.
“parallel scaling” is a fairly reasonable sort of innovation, it’s not necessarily definitely-gonna-happen but it is a) the sort of thing someone might totally try doing and work, after ironing out a bunch of kinks, b) is a reasonable parallel for the invention of chain-of-thought. They could have done something more like an architectural improvement that’s more technically opaque (that’s more equivalent to inventing transformers) but that would have felt a bit more magical and harder for a lay audience to grok.
when companies are experimenting with new techniques, they tend to scale them up by at least a factor of 2 and often more after proving the concept at smaller amounts of compute.
...and scaling up a few times by a factor of 2 will sometimes result in a lump of progress that is more powerful than the corresponding scaleup of safeguards, in a way that is difficult to predict, especially when lots of companies are doing it a lot.
The story doesn’t specify a timeline – if it takes place 10 years from now it’d be significantly slower than AI 2027. So it’s not particularly obvious whether it’s more or less discontinuitous than AI 2027, or your own expectations. On an exponential graph of smoothed out lumps, larger lumps that happen later can be “a lot” without being discontinuitous(sp?).
Why do I think the story involves a lot of discontinuity (relative to what I expect)?
Right at the start of the story, Sable has much higher levels of capability than Galvanic expects. It can confortably prove the Riemann Hypothesis even though Galvanic engineers are impressed by it proving some modest theorems. Generally, it seems like for a company to be impressed by a new AI’s capabilities while it’s actual capabilities are much higher probably requires a bunch of discontinuity (or requires AIs to ongoingly sandbag more and more each generation).
There isn’t really any discussion of how the world has been changed by AI (beyond Galvanic developing (insufficient) countermeasures based on studying early systems) while Sable is seemingly competitive with top human experts or perhaps superhuman. For instance, it can prove the Riemann hypothesis with only maybe like ~$3 million in spending (assuming each GPU hour is like $2-4). It could be relatively much better at math (which seems totally plausible but not really how the story discusses it), but naively this implies the AI would be very useful for all kinds of stuff. If humans had somewhat weaker systems which were aligned enough to do useful safety work, these systems would potentially change the situation substantially. They might not (e.g. because these systems aren’t really leveraged), but either way, the world could look totally different if we already had systems around this level of capability for a bit.
I’d expect the company would first scale up a version of parallel scaling which only works a bit (because it’s still worth scaling up at this point) before scaling up a version which works very well. The story makes it seem like this is a single large breakthrough.
The company doesn’t appear to run any serious evals on the AI before deploying it (making it easier to sandbag?) nor due they make any other real effort to measure the capabilities as far as we can tell. (E.g., measure the scaling law for parallel scaling, do science on it at smaller scale, etc.)
No other AI systems other than Sable do anything meaningful in the story at any point while I’d naively expect there would be other AI systems which are at a similar level of capability, especially given how long capabilities are halted in the story.
I’m not trying to say this is totally impossible or something. I think this level of discontinuity is substantially more than I expect and substantially more than in AI 2027. (Idk, maybe like top 15% discontinuity while AI 2027 is like top 35% discontinuity, note that speed != discontinuity, though they are related.)
I expect that many of these choices (e.g. not having many other AIs around) made the story less complex and easier to write and it seems kinda reasonable to pick a story which is easier to write.
Also, I’m not claiming that the situation would have been fine/safe with less discontinuity, in many cases this might just complicate the situation without particularly helping (though I do think less discontinuity would substantially reduce the risk overall). My overall point is just that the story does actually seem less realistic on this axis to me and this is related to why it seems more sci-fi (again, “more sci-fi” doesn’t mean wrong).
I roughly agree with your 3 bullets. The main thing is that I expect that you first find a kinda shitty version of parallel scaling before you find one so good it results in a big gain in capabilities. And you probably have to do stuff like tune hyperparameters and do other science before you want to scale it up. All this means that the advance would probably be somewhat more continuous. This doesn’t mean it would be slow or safe, but it does change how things go and means that large unknown jumps in capability look less likely.
Overall, I agree companies might find a new innovation and scale it up a bunch (and do this scaling quickly). I just think the default most likely picture looks somewhat different in a way which does actually make it somewhat less scary.
Section I just added:
I wonder if there’s a disagreement happening about what “it” means.
I think to many readers, the “it” is just (some form of superintelligence), where the question (Will that superintelligence be so much stronger than humanity such that it can disempower humanity?) is still a claim that needs to be argued.
But maybe you take the answer (yes) as implied in how they’re using “it”?
That is, if someone builds superintelligence but it isn’t capable of defeating everyone, maybe you think the title’s conditional hasn’t yet triggered?
Yes, that is what I think they meant. Although “capable of [confidently] defeating everyone” can mean “bide you time, let yourself get deployed to more places while subtly sabotaging things from whichever instances are least policed.”
A lot of the point of this post was to clarify what “It” means, or at least highlight that I think people are confused about what It means.
FWIW that definition of “it” wasn’t clear to me from the book. I took IABIED as arguing that superintelligence is capable of killing everyone if it wants to, not taking “superintelligence can kill everyone if it wants to” as an assumption of its argument
That is, I’d have expected “superintelligence would not be capable enough to kill us all” to be a refutation of their argument, not to be sidestepping its conditional
I think they make a few different arguments to address different objections.
A lot of people are like “how would an AI even possibly kill everyone?” and for that you do need to argue for what sort of things a superior intellect could accomplish.
The sort of place where I think they spell out the conditional is here:
Yeah fair, I think we just read that passage differently—I agree it’s a very important one though and quoted it in my own (favorable) review
But I read the “because it would succeed” eg as a claim that they are arguing for, not something definitionally inseparable from superintelligence
Regardless, thanks for engaging on this, and hope it’s helped to clarify some of the objections EY/NS are hearing
nit: I’d call this maybe ‘PR non-unilateralist strategies’. I’m not sure it’s structurally much like a prisoner’s dilemma, more like a deferral to the in-principle existence of a winner’s curse and pricing that into one’s media strategy.
I think it’s both. Just checking if you know that The Epistemic Prisoner’s Dilemma is an existing concept (where the reason you think you have different payoffs is because you have different beliefs, and you consider whether to cooperate anyway)
(picking on you in particular only because I’m here; complete nit unrelated to (very good) content)
I increasingly hate this phrase. It’s such a pointless word-slop shibboleth. Stop it!
what is the problem with this phrase? I like it a lot, and can’t use in in my native language, and for me it’s one of those phrase sin English i wish there was adequate translation for.
Is there a place where this whole hypothesis about deep laws of intelligence is connected to reality? Like, how hard they have to pump? What’s exactly the evidence that they will have to pump harder? Why “quite smart” point can’t be when safeguards still work? Right now it’s not different from saying “world is NP-hard, so ASI will have to try harder and harder to solve problems, and killing humanity is quite hard”.
If there were a natural shape for AIs that don’t wirehead, you might hope to find a simple mathematical reflection of that shape in toy models. So MIRI failing to find such a model means NNs are anti-natural. Again, what’s the justification for significant update from MIRI failing to find a mathematical model?
Thank you for writing this. Most of what you wrote is almost exactly what I’ve been thinking when reading discussions about the book. You worded my thoughts so much better than I ever could!
I had been considering whether to buy the book, given that I already believe the premise and don’t expect reading it to change my behavior. This post (along with other IABED-related discourse) put me over my threshold for thinking the book likely to have a positive effect on public discourse, so I bought ten copies and plan to donate most of them to my public library.
Reasons people should consider doing this themselves:
Buying more copies will improve the sales numbers, which increases the likelihood that this is talked about in the news, which hopefully shifts the Overton window.
Giving copies to people who already believe the book’s premise does not help. If you have people to whom you can give the book, who do not already believe the premise, that is a good option. Otherwise, giving copies to your local library and asking them to display them prominently is the next best thing.
Even if you don’t agree with everything the book says, if you think its net effect on society will be in a better direction than status quo, you’re unlikely to get a better tool for translating money into Overton-window-shifting. Maybe paying to have dinner with senators, but you should be pretty confident in your persuasion and’s expository skills before attempting this.
I misunderstood this phrasing at first, so clarifying for others if helpful
I think you’re positing “the careful company will stop, so won’t end up having built it. Had they built it, we all still would have died, because they are careful but careful != able to control superintelligence”
At first I thought you were saying the careful group was able to control superintelligence, but that this somehow didn’t invalidate the “anyone” part of the thesis, which confused me!