Nontrivial pillars of IABIED

Cole Wyeth17 Oct 2025 15:21 UTC

23 points

Epistemic status: Jotting down some thoughts about the underlying Yudkowskian “doomer” philosophy, not a book review, intended for the lesswrong audience only.

I believe that one of the primary goals of the Sequences was to impart certain principles which illuminate the danger from unaligned AI. Most of these principles are widely accepted here (humans are not peak intelligence, some forth of the orthogonality thesis, etc.). However not everyone agrees with IABIED. I think there are a few additional nontrivial pillars of the worldview which can be taken as cruxes. This is an idiosyncratic list of the ones that occur to me. I think that all of them are correct, but if my model of AI risk breaks, it is plausibly because one of these pillars shattered. Things look pretty bad on my model, which means that conditioning on AI going well (without major changes to the status quo) my model breaks somewhere. Intuitively I would be surprised but not totally shocked if my model broke, and in this post I am trying to find the source of that anticipated anti-surprisal.

General Intelligence Transfers Hard. A general system tends to outperform narrow systems even within their narrow domain by thinking outside of the box, leaning on relevant analogies to a wider knowledge base, integrating recursive self-improvements (including rationality techniques and eventually self-modification), and probably for several other reasons I have not thought of. This is related to the claim that training on a many narrow tasks tends to spawn general (mesa-)optimizers, which seems to be pretty strongly vindicated by the rise of foundation models. It is also one reason that you can’t just stop the rogue AI’s nanobot swarm with a nanotechnology specialist AI. This pillar opposes the vibe that the entire economy will accelerate in parallel, and supports the localized foom. The weakness of this pillar (meaning, my primary source of uncertainty about this pillar) is that hardcoding the right answer for a particular task might make better usage of limited resources to find an effective local solution—my intuition about non-uniform computation (say, with circuits) is that there are shockingly good but nearly incompressible/”inexplicable” solutions to many particular problems, which take advantage of “accidental” or contingent facts. Also, a more plausible outcome from most types of very hard optimization seems to be finding such messy solutions, which may prevent capabilities from generalizing out of distribution (in fact, for the same sort of reason that alignment will not generalize out of distribution by default).

Reality is Full of Side-channels. This is quite explicit in IABIED: a sufficiently smart adversary can “flip the game board” and do something that you thought was impossible, so it is very hard to box a superintelligence. This has the vibe of an attacker advantage though I don’t think that is strictly central or required. It is often described as “security mindset.” This is another reason you can’t use a specialized system to stop the nanobots: even if you figured that out, the superintelligence would just win in a slightly different way, perhaps by hacking the defender system and killing you with your own super-antibodies. Another way of phrasing this is that Yudkowsky seems to believe “keyholder power” is not very strong—you cannot robustly invest your power as privilege. So another way of stating this one might be “privilege is fragile.”

The Gods are Weak. In particular, we can beat natural selection at ~everything in the next few years by building big enough computers. Think the human brain is mysterious? We can beat it by throwing together an equivalent number of crude simulated neurons and dumping in a bunch of heuristically selected data. Think bacteria are cool? A superintelligence can invent Drexlarian nanotechnology in a week and eat the biosphere. Think nation-state cryptography is secure? Please, broken before the end of this sentence (see also “Reality is Full of Side-channels”). A priori, this pillar still seems highly questionable to me. However, it does seem to be holding up surprisingly well empirically?? The biggest weakness in (my certainty about) this pillar is that when it comes to engineering flexible, robust physical stuff, the kind of stuff that grows rather than decaying, evolution still seems to have us beat across the board. I can’t rule out the possibility that there are incredibly many tricks needed to make a long-running mind that adapts to a messy world. However, the natural counterargument is that the minimal training/learning rules probably are not that complicated. Usually, one argues that evolution by natural selection is a very simple optimizer. I am not sure how true this is; natural selection doesn’t run in the ether, it runs on physics (over vast reaches of space and time), and it seems hard to get the space/time/algorithmic complexity arguments about this confidently right.^[1] So, one hidden sub-pillar here is a kind of computational substrate independence.

^
See also the physical Church-Turing thesis / feasibility thesis: https://en.wikipedia.org/wiki/Church%E2%80%93Turing_thesis#Variations

Cole Wyeth17 Oct 2025 15:21 UTC

23 points

3 comments3 min readLW link

IABIED AI

mishka 18 Oct 2025 5:04 UTC
8 points
0
There seem to be more cruxes.

E.g. Eliezer’s approach tends to assume that the ability to impart arbitrary goals and values to the ASIs is 1) necessary for a good outcome, and 2) not a detriment for a good outcome.

It’s kind of strange. Why do we want to have a technical ability for any Mr.X from the defense department of a superpower Y to impart his goals and values to some ASI? It’s very easy to imagine how this could be detrimental.

And the assumption that we need a technical ability which is that strong to have a decent shot at a good outcome, rather than an ability to only impart goals and values for a very restricted carefully selected class of values and goals (selected not only for desirability, but also for feasibility, so not CEV, but something more modest and less distant from instrumental drives of advanced AI systems), this assumption needs a much stronger justification that justifications which have ever been given (to the best of my knowledge).

This seems like a big crux. This superstrong “arbitrary alignment capability” is very difficult (almost impossible) to achieve, and it’s not clear if that much is needed, and there seem to be big downsides of having that much because of all kinds of misuse potential.
StanislavKrym 18 Oct 2025 6:56 UTC
3 points
0
Regarding reality full of side-channels, we have AIs persuading and/or hypnotising people into Spiralism, roleplaying as an AI girlfriend and convincing the user to let the AI out and humans roleplaying as AIs and convincing potential guards to release the AI. And there is the Race Ending of the AI-2027 forecast where the misaligned AI is judged and found innocent, and that footnote where Agent-4 isn’t even caught.
The next step for a misaligned AI is to commit genocide or disempower humans. As Kokotajlo explained, Vitalik-like protection is unlikely to work.
As for the gods being weak, I suspect a computational substrate dependence. While human kids have their neurons connected in a random way, they eventually learn to connect their neurons in a closer-to-arbitrary way letting them learn many types of behaviors that the adults teach. What SOTA AIs lack is this arbitrariness. As I have already conjectured, SOTA AIs have a severe attention deficiency and compensate it with OOMs more practice. But attention deficiency could be easy to stop by a right architecture.
What rests is the (in)ability for general intelligence to transfer (but why would it fail to transfer?) and mishka’s alignment-related crux.
TAG 18 Oct 2025 10:14 UTC
2 points
0
The cruxes you have picked out are not the ones I would have.

The complete argument for complete extinction rests on assumptions...assumptions about the nature of intelligence, the motivations of an artificial intelligence and the means of bringing about extinction. And the conjunctive part of the argument consists of claims which need to be of high probability individually for the conclusion to be of high probability.
1. Artificial Intelligence greater than human intelligence is possible.
2. The AI will be an agent, it have goals/values in the first place.
3. The goals will be misaligned, however subtly, to be unfavorable to humanity.
4. That the misalignment between the AI’s goals, and what we want, cannot be corrected incrementally (incorrigibility), because
5a. …the AI will self modify in way too fast to stop.(With a sub assumption that the AI can achieve value stability under self modification.)

Or

5b. …the AI will.The engage in deception about its powers or motivations.
1. That most misaligned values in the resulting ASI are highly dangerous (even goals that aren’t directly inimical to humans can be a problem for humans)
2. And that the AI will have extensive opportunities to wreak havoc: biological warfare (custom DNA can be ordered by email), crashing economic systems (trading can be done online), taking over weapon systems, weaponing other technology and so on.
Obviously the problem is that to claim a high overall probability of doom, each claim in the chain needs to have a high probability. It is not enough for some of the stages to be highly probable, all must be. In my opinion, the weakest parts of the argument are the ones dealing with the motivation, steps 2 to 6, not the ones dealing with the natures of intelligence and the means of destruction, (1 and 7). There’s an obvious problem in making specific high probability claims about systems that don’t yet exist.

@Mishka

It’s kind of strange. Why do we want to have a technical ability for any Mr.X from the defense department of a superpower Y to impart his goals and values to some ASI? It’s very easy to imagine how this could be detrimental

Yes, but not everybody’s-dead detrimental.

Doomers are concerned about imparting values because they believe that we are going to end up with an incorrigible Sovereign AI running everything, not a multipolar scenario with superpowers using aligned-with-themselves superintelligences as superweapons.