About Me
Scientist by training, coder by previous session,philosopher by inclination, musician against public demand.
Why I am not a Doomer
I’m specifically addressing the argument for a high probability of near extinction (doom) from AI...
Eliezer Yudkowsky: “Many researchers steeped in these issues, including myself, expect that the most likely result of building a superhumanly smart AI, under anything remotely like the current circumstances, is that literally everyone on Earth will die. ”
....not whether it is barely possible, or whether other, less bad outcomes (dystopias) are probable. I’m coming from the centre, not the other extreme
Doom, complete or almost complete extinction of humanity, requires a less than superintelligent AI to become superintelligent either very fast , or very surreptitiously … even though it is starting from a point where it does not have the resources to do either.
The “very fast” version is foom doom...Foom is rapid recursive self improvement (FOOM is supposed to represent a nuclear explosion)
The classic Foom Doom argument (https://www.greaterwrong.com/posts/kgb58RL88YChkkBNf/the-problem) involves an agentive AI that quickly becomes powerful through recursive self improvement, and has a value/goal system that is unfriendly and incorrigible.
The complete argument for Foom Doom is that:-
-
The AI will have goals/values in the first place (it wont be a passive tool like GPT*),.
-
The values will be misaligned, however subtly, to be unfavorable to humanity.
-
That the misalignment cannot be detected or corrected.
-
That the AI can achieve value stability under self modification.
-
That the AI will self modify in way too fast to stop.
-
That most misaligned values in the resulting ASI are highly dangerous (even goals that aren’t directly inimical to humans can be a problem for humans, because the AS I might want to director sources away from humans.
-
And that the AI will have extensive opportunities to wreak havoc: biological warfare (custom DNA can be ordered by email), crashing economic systems (trading can be done online), taking over weapon systems, weaponing other technology and so on.
It’s a conjunction of six or seven claims, not just one. ( I say “complete argument ” because pro doomers almost always leave out some stages. I am not convinced that rapid self improvement and incorrigibility are both needed, both needed, but I am sure that one or the other is. Doomers need to reject the idea that misalignment can be fixed gradually, as you go along. . A very fast-growing ASI, foom, is way of doing that; and assumption that AI’s will resist having their goals changed is another).
Obviously the problem is that to claim a high overall probability of doom, each claim in the chain needs to have a high probability. It is not enough for some of the stages to be highly probable, all must be.
There are some specific weak points.
Will an AI be an Agent?
Minimally, a software agent is system that acts without being specifically instructed to (the other kind being known as a tool AI). For instance, a high frequency trading agent starts playing the stock market, much faster than a human, as soon as it is booted up. Software agents need a goal to determine their actions, in the absence of specific instructions. Doomers have put forward scenarios where an AI agent relentlessly pursuing the seemingly harmless goal of making paperclips uses up all the iron in the world, eventually eyeing the iron in human blood).
Doing anything whatsoever doesn’t imply a goal. Goals constrain a general system to a specific purpose, and the system will do something different if the goal is changed. A toaster can only make toast, so it has no goal of making toast.
John McCarthy’s definition that intelligence is the computational part of the ability to achieve goals also needs to be taken broadly—the goals don’t have to be the intelligent agents own. An intelligent servant is quite conceivable
The Orthogonality Thesis (https://www.lesswrong.com/w/orthogonality-thesis states that a lot of combinations of goals and intelligence levels are possible.
It is correctly called on to refute an argument for AI safety from moral realism: that the AI will use its intellegence to figure out what is the true morality; that it will act on it; and that the true MR will be favourable to humans. Without saying that moral realism is false , the second and third steps are far from given.
But it doesn’t imply that all possible minds have goals. Mindspace combines all combinations of intelligence and goals, including no That’s not just an abstract possibility. At the time of writing, 2025, our most advanced AI’s, the Large Language Models, are non agentive and corrigible.
There are also arguments that we humans will not stick with the safer seeming tool AI’s—that AI’s will become agentive, because that”s what humans want. Gwerns Branwen’s confusingly titled “Why Tool AIs Want to Be Agent AIs” ( https://gwern.net/tool-ai) is an example. It is true, but in more than one sense:-
The basic idea is that humans want agentive AI’s because they are more powerful. And people do want power, but not at the expense of control. Power that you can’t control is no good to you. Taking the brakes off a car makes it more powerful, but more likely to kill you. No army wants a weapon that will kill their own soldiers, no financial organisation wants a trading system that makes money for someone else, or gives it away to charity, or causes stock market crashes. The maximum amount of power and the minimum of control is an explosion.
One needs to look askance at what “agent” means as well. Among other things, it means an entity that acts on behalf of a human—as in principal/agent.(https://en.m.wikipedia.org/wiki/Principal–agent_problem) An agent is no good to its principal unless it has a good enough idea of its principal’s goals. So while people will want agents, they wont want misaligned ones—misalgined with themselves, that is. Like the Orthogonality Thesis, the argument is not entirely bad news.
Of course, evil governments and corporations controlling obedient superintelligences isn’t a particularly optimistic scenario, but it’s dystopia, not doom.
Is Goal Stability a Given?
Goal stability is an assumption behind recursive self improvement, which is an argument behind rapid takeoff, which is an argument behind getting goals/value right on the first attempt
An AI that cannot guarantee that it’s descendants will share its goals , might be reluctant to create new versions of itself.
It is plausible that an agent would desire to preserve its goals, but the desire to preserve goals does not imply the ability to preserve goals.
An argument for the desire to preserve goals is the “Ghandi Pill” argument. “If you offer Ghandi a pill that makes him love murdering people, he won’t take the pill because right now he doesn’t want to murder people. A self-modifying AI that wants things will tend to avoid changing what it wants, because it can predict that would lead to things it doesn’t want”
But humans do not preserve the same goals across their lifetimes, and do things which are known to change values, such as getting educated, travelling, and getting married. So the Ghandi Pill argument only seems to to apply to extreme cases, in humans. And there a no reason to suppose that any given AI will b different to a human—the human mind is a possibility in mindspace.
The Orthogonality Thesis is also sometimes mistakenly called on to support goal stability. It does imply that a lot of combinations of goals and intelligence levels are possible, but doesn’t imply that all possible minds have goals, or that all goal driven agents have fixed, incorrigible goals. There are goalless and corrigible agents in mindspace, too.
No goal stable system of any complexity exists on this planet, and goal instability cannot be assumed as a default or given. So the orthogonality thesis is true of momentary combinations of goal and intelligence, given the provisos above, but not necessarily true of stable combinations.
There is, however, a puzzle about how an agent would rationally update its goals. AI doomers like to think in terms of von Neumann rationality.
“In decision theory, the von Neumann–Morgenstern (VNM) utility theorem demonstrates that rational choice under uncertainty involves making decisions that take the form of maximizing the expected value of some cardinal utility function. The theorem forms the foundation of expected utility theory.”
According to the framework of von Neumann rationality, such an update must be guided by a set of terminal goals … but how can it be when the agents goals themselves are subject to change?
But the puzzle is rather artificial and inconsequential. For one thing, vNM is a highly artificial framework, with the advantage of being formally well defined, and the disadvantage of being inapplicable to almost any real world system, including most AI’s .. it’s literally an academic question.
For another thing, rational people—in the informal sense—are capable of reflecting on , and changing, their goals and values. (https://www.greaterwrong.com/posts/8A6wXarDpr6ckMmTn/another-argument-against-utility-centric-alignment-paradigms).?
Irrational goal-drift is possible, too. One thing the Orthogonality Thesis does show is that most possible minds aren’t going to converge on a single set of values. Most will not be artificial philosophers who will converge onto valuing all sapient life, or some other candidate for the One True Morality ”.
In summary, there are no strong arguments for goal stability.
Will an AI Recursively Self Improve?
An artificial general intelligence can do anything a human can, at least as well. A human can build AI, so an AGI can build an AI … so an AGI can build a better version of itself. That’s the argument for Recursive Self Improvement.
There’s also an argument against it. Analogously to Löb’s theorem in logic (https://en.wikipedia.org/wiki/Löb’s_theorem) , an AI would not be able to reliably predict the behaviour of a more powerful version of itself, meaning that it if it was concerned about its successor sharing its goals it might refrain from self improvement.
Goal stability under self improvement is not a given: it is not possessed by all mental architectures, and may not be possessed by any, since noone knows how to engineer it, and humans appear not to have it. That’s a problem for Rapid Recursive Self Improvement, because an AI that wants to self improve cannot guarantee that it’s descendant will share its goals—even though self improvement is instrumental useful
Are Most Agentive Goals Dangerous? .
An AI doesn’t have to hate humanity to be a danger to humanity, so the argument goes, many kinds of indifference are dangerous, too.
Instrumental Convergence (https://aisafety.info/questions/897I/What-is-instrumental-convergence) assumes an agent with terminal goals, the things it really wants to do , and instrumental goals , sub goals which lead to terminal goals. (Of course, not every agent has to have structure). Instrumental Convergence suggests that even if an agentive AI has a seemingly harmless goal, it’s instrumental sub-goals can be dangerous. Just as money is widely useful to humans , computational resources are widely useful to AIs. Even if an AI is doing something superficially harmless like solving maths problems, more resources would be useful, so eventually the AI will compete with humans over resources, such as the energy needed to power data centres.
A good way of demonstrating what is talked about is through the web game, universal paperclip; https://www.decisionproblem.com/paperclips/index2.html
There is a solution.** If it is at all possible to instill goals, to align AI, the Instrumental Convergence problem can be countered by instilling terminal goals that are the exact opposite** … remember, instrumental goals are always subservient to terminal ones. So, if we are worried about a powerful AI going on a resource acquisition spree , we can give it a terminal goal to be economical in the use of resources.l
Darwinism. So long as humans are in charge, selection is artificial , not natural.
Many arguments for AI Doom suppose that an advanced AI will follow instructions in an unhelpful literal way , like a genie or Monkey’s paw story.
Can AI be made Safe?
There are a number of routes to AI safety.
Alignment roughly means the AI has goals, or values similar to human ones , so that even acting agentively without supervision, it will do --- Will an AI be an Agent?
Minimally, a software agent is system that acts without being specifically instructed to (the other kind being known as a tool AI). For instance, a high frequency trading agent starts playing the stock market, much faster than a human, as soon as it is booted up. Software agents need a goal to determine their actions, in the absence of specific instructions. Doomers have put forward scenarios where an AI agent relentlessly pursuing the seemingly harmless goal of making paperclips uses up all the iron in the world, eventually eyeing the iron in human blood).
Is Alignment Possible?
There is a lot conclusion about this subject, not least because the word “alignment” is used to mean different things. For some, it’s synonymous with safety, for others, it’s a particular approach to safety. for some It’s once-and-for-all, for others iterative. For some, only perfection will sufice, for others, good enough is good enough.
For those reasons, treating aligned/misaligned as a simple binary isn’t helpful. Nonetheless it is very common.
Doomers typically disbelieve in corrigibility, and that AI’s will remain tools, so they are left-with alignment as the means to AI safety. But they also doubt that steady, incremental alignment will be good enough. One way alignment can be made to look difficult is stating that it has to be done in a maximal way to achieve minimal results. The minimal result is not killing us, all, the maximal way is instilling into the AI every nuance of human value including aesthetic value. Under circumstances when a,powerful ASI takes over and starts running things according to its original programming, without listening to feedback, a detailed knowledge of human values would be necessary to create a utopia. But that is a far cry from not killing us all.
Another way of making successful alignment look difficult is the making the assumption that AI’s will become very powerful , very quickly, in an unsupervised way, so that humans only have one chance to get alignment correct, before the ASI becomes too powerful to listen to humans. That’s the idea underlying the Fragility of Value. But it doesn’t actually matter how complex or subtle value is , so long as you can tweak a specification of value at leisure—the conclusion that alignment won’t work depends on the fast takeoff premise. of course, a fast “take off” isn’t impossible, it is too often treated as a certainty, and too often left as an implicit assumption.
A more valid objection to alignment is that there is “human value” doesn’t exist as a coherent target. There is just too much disagreement—a socialist utopia is based on quite different values to a capitalist utopia.
Anti-doomers have an argument that makes alignment look easy: current AIs are trained on huge corpora of data in multiple languages , which contains far more information on human attitudes and behaviour than any one person can carry in their heads … so they actually have a full and nuanced concept of human value—more than is needed for basic safety. This is called alignment-by-default. (Note that the traditional arguments about the complexity of human value assume it needs to be explicitly hard coded. That’s probably difficult , but also unnecessary. Such arguments are outdated).
Is it working? For modest definitions of alignment, it is possible because it is actual. A completely unaligned uncontrollable AI would be completely unco-operative , and therefore of no commercial use, so the prevailing level of alignment is good enough For perfectionist definitions of alignment, it isn’t because current AIs sometimes to.weird or dangerous things, but perfection is generally unobtainable. (Corrigibility is going well, too, as we.shall see).
https://www.lesswrong.com/posts/QdEHS9TmBBKirz7Ky/llms-are-badly-misaligned
PSI have been talking a lot about alignment because doomers do. But corrigibility and non-agentive AI are still possibilities.
Is Corrigibility Impossible?
Minimally, corrigibility is being able to correct an AI’s goals.or values as you go along. (https://www.greaterwrong.com/posts/NQK8KHSrZRF5erTba/0-cast-corrigibility-as-singular-target-1) It requires goal stability to be false, plus an ability to steer and AI’s goals externally. The premise that corrigibility is impossible is one of the ways of getting to the conclusion that alignments a one shot process that has to be got right one time.
The Orthogonality Thesis indicates that any combination of intelligence and goals is possible … but that extends to goal dynamics: intrinsically unstable, evolving goals are possible; and so are externally shapeable goals. We can see this in human human behaviour as well...people do things that are known to modify values , such as travelling, getting an education and starting a family. So the orthogonality thesis is actually good news for corrigibility.
Another thing that doesn’t prove incorrigibility or goal stability is von Neumann rationality. Frequently appealed to in MIRI ’s early writings , it is an idealised framework for thinking about rationality , that doesn’t apply to humans, and therefore doesn’t have to apply to any given mind. Even a von Neumann rationalist isn’t necessarily incorrigible, it depends on the fine details of the goal specification. A goal of “ensure as many paperclip as possible in the universe” encourages self cloning, and discourages voluntary shut down. A goal of “make paperclips while you are switched in” does not. “Make paperclip while that’s your goal”, even less so.https://www.lesswrong.com/posts/ksfjZJu3BFEfM6hHE/why-corrigibility-is-hard-and-important-i-e-whence-the-high?commentId=jtKrCczWaduaaYzk9
Corrigibility isn’t just possible, it’s also simple.
Humans are intrinsically corrigible, as a social animals, and have specific motions, pride and shame, which are used by society to shape their systems.(as explained in the https://en.wikipedia.org/wiki/The_Emotion_Machine) Society provides the value systems, which can be very complex and variable … whereas, nature only provides the simple , but vital , “hooks” that value-shaping depends on. And we can tell that the hooks are simple, because toddlers (and some animals) can implement them. The basis of corrigibility in organisms is also n in rational, which answer the basic objection
‘We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences’.
There’s an academic puzzle about why a rational agent would be motivated by its current goals to switch to a new set of goals .. but it’s not a practical problem The reason is that rationality is intrinsically high level: it never goes all the way down—in organisms or AI’s. AI’s aren’t rational at the lowest level or programming, they just blindly follow their code; and humans learn rationality during the same developmental period that they undergo basic behaviour correction. That being the case , there is no need to solve the problem of making a rational agent corrigible.
The complexity of the target, final value system is irrelevant to the complexity of the hooks. Humans can learn deference and respect in culturally complex ways, eg. to defer to various gods and ancestors, but that’s still based on the same simple hooks. So there is no equivalent to the “small target” objection for corrigibility
The possession of such hooks that isn’t inevitable: an organism that hatches from an egg and immediately starts foraging has no need of pride or shame.So corrigibility isn’t inevitable … but “corrigibility isn’t inevitable” is a far cry from “corrigibility is impossible”. Since corrigibility isn’t inevitable, it will have to be designed into a human (or greater) level AI. (You cannot have learning all thecway)/ Designing it in cannot be difficult in comparison to AGI itself, because corrigibility is only a subset of the abilities of an organism that is less intelligent and than an adult human, eg. a child or dog. And there is not a large space of simple corrigibilites.
Corrigibility isn’t just possible and simple, it’s here. When Musk’s AI Grok started promoting Naziism in july 2025, it was amended within the week.
Maximally , corrigibility can be as overthought as you want.
‘The “hard problem of corrigibility” is to build an agent which, in an intuitive sense, reasons internally as if from the programmers’ external perspective’.
Is a much harder problem .. and not what most people mean by corrigibility. It’s artificially hard.
Smartness does not imply incorrigibility. Smart agents can be controlled in crude ways. A nobel winner can be controlled with an electric shock collar.
Is Alignment Improbable?
There’s are two complementary arguments against alignment. That there is a small target to hit, and that the space of possible minds is vast.
When Yudkowsky says the target is small, he means that the target is human values. He assumes that there is a set of values that is uniform across different people, fixed across time, immensely complex and transmitted genetically. All I can is that there is no evidence for any of the four properties, and plenty against some of them.
(See https://www.greaterwrong.com/posts/4ARaTpNX62uaL86j6/the-hidden-complexity-of-wishes/comment/AuaNR7rJ5WdCfyKXg For a mainstream taje, seehttps://www.greaterwrong.com/posts/KacESZhBYCt9hLxCE/the-heterogeneity-of-human-value-types-implications-for-ai)
Real world AI developers aren’t aiming for a single quintessence of human value, they are aiming to align their AI’s according to the requirements of their organisations. Look how different Elon Musk’s Grok is to the others.
What needs to be demonstrated is that there were is a low probability of hitting the alignment target. But the *size” of the alignment target .. the fact that there are many other possible minds or value systems ..isn’t the same thing. It doesn’t amount to the same thing unless one assumes an even probability of hitting any target … and that isn’t necessarily so.
The Equiprobability is only one way of turning possibilities into probabilities. . Random potshots aren’t analogous to the probability density for action of building a certain type of AI, without knowing much about what it would be. (https://www.greaterwrong.com/posts/YsFZF3K9tuzbfrLxo/counting-arguments-provide-no-evidence-for-ai-doom)
While, many of the minds in mindpsace are indeed weird and unfriendly to humans, that does not make it likely that the AIs we will construct will be. we are deliberately seeking to build certainties of mind for one thing, and have certain limitations, for another. Current LLM ’s are trained in vast copora of human generated content, and inevitably pick up a version of human values from them.
(Ben Garfinkel: To illustrate a bit how this works with a quote, there’s a quote from a number of different Eliezer Yudkowsky essays, where he says for most choices of goals, instrumentally rational agencies will predictably wish to obtain certain generic resources such as matter and energy. The AI does not hate you, but neither does it love you. And you are made of atoms that can be used for something else. Another presentation from another MIRI talk is most powerful intelligences will radically rearrange their surroundings because they’re aimed at some goals or other. And the current arrangement of matter that happens to be present is not the best possible arrangement for suiting their goals. They’re smart enough to find some in access by their arrangements. That’s bad news for us because most of the ways of rearranging our parts into some other configuration would kill us. So the suggestion again seems to be that because most possible, in some sense, very intelligent systems you might create will engage in, perhaps omnicidal behaviors, and so we should be quite concerned about creating such systems.)
Hitting a small target is harder if you have to do it bullet style, with a single well aimed shot .. and easier if you can do it guided missile style, with course directions along the way … which is an analogy for corrigibility. Corrigibility means you can tweak an AI’s goals gradually, as you go on, so there s no need to get them exactly right on the first try.
The argument from “value fragility” is almost the same thing. It’s true that the values of a given human are complex, but not unconditionally true that you have to code or train them into an AI correctly at the first attempt. It’s also not clear that there is such a a well defined entity as “human value” … but there is also no need for an AI to to follow human value in that sense , unless it is a sovereign that is running the whole world single handedly … that’s another claim that’s conditioned on an unlikely premise. For practical purposes, a usable agentive AI only needs to be aligned with the values of whoever is using it.
Organic Growth and Unintertpretability
It’s true that in many ways, AI is “grown” , not engineered, but it’s not entirely new, or deadly.dangerous.
Not fully understanding things is the default … even non AI software can’t be fully understood if it is complex enough. We already know how to probe systems we don’t understand apriori, through scientific experimentation. You don’t have to get alignment right first time, at least not without the foom/RRSI or incorrigibility assumptions.
Near and Far Term.
There are reasons for thinking doom will not occur in the near term. They do not imply doom must occur in the far term. In the near term, AI’s still need humans to.maintain their data centres, so it would be foolish to kill all humans. Powerful AI’s would need to transition to automatically, perhaps robotically, maintained data centers, automated factories, and so on. In addition, there are not any “spare” data centres for an AI to surreptitiously copy itself to , if it’s initial data centre is powered down, so the off switch still works.
There are also arguments that what is safe in the. near term will prove dangerous in the long term..
The Evolutionary Argument.
This argument notes that, while evolution results in goals which are adaptive to the original environment , they can go haywire in another. For instance, the tendency to enjoy high calorie foods leads to obesity and ill health in an environment of food abundance. Therefore , AIs that are aligned now will be misaligned in the future...?
Of course, there are many differences between real, organic evolution, and the way AI’s change over time.
As we have seen several times, the argument does not work without additional assumptions … in this case, assumptions of either a sudden transition , or a lock-in of goals. AI’s don’t literally evolve, even though they are also not explicitly engineered like traditional software, so the same degree of lock-in would not be expected.
Gradual or Sudden?
Most of the people who are a involved in building AI don’t expect too be killed by their invention. What they expect is to develop AI safety measures as they go along.
Doomers disagree.
A number of reviewers have noticed the same problem IABIED: an assumption that lessons learnt in AGI cannot be applied to ASI—that there is a “discontinuity or “phase change”.
It’s not entirely clear why..whether the discontinuity is based on:- Incorrigibility Sudden take off (foom) Deception.
And the burden is on them to show that the people actually building AI are wrong.
A number of reviewers have noticed the same problem with IABIED: an assumption that lessons learnt in AI cannot be applied to ASI—that there is a “discontinuity or “phase change”.
It’s also notable that doomers see AI alignment as a binary, either perfect and final, or non existent. But no other form of safety works like that. No one talks of “solving” car safety for once and all like maths problem, instead it’s assumed to be an engineering problem, an issue of making steady , incremental progress. Good enough alignment is good enough
There’s an argument that safety lessons learnt on lesser AI cannot be applied to a true superintelligence. But is there a line between intelligence and superintelligence … or a shallow slope?
If ASI is only slightly behind human capabilities , then good enough alignment is good enough. If ASI is developed gradually , alignment can be tweaked as you go along. Only a sudden leap to much more than human ASI constitutes the problem.
There’s a very basic difference between the people who believe in SLT’S, rapid RSI , etc and those who don’t, and it affects their unspoken assumptions and semantics. The thing where it affects their senstuvs is a problem.
I don’t se why.
Agreed. I didn’t say so explicitly, but I was mainly concerned with Everybody Dies scenarios. I think a multipolar scenario where ASI are controllable and controlled by powerful interests is highly likely, but not completely fatal.
Ok, but that’s a different complaint to “it’s not even possible”. Also , the market is for agents that work for you, but do your own things. That’s a point against the standard Doom argument with a sovereign AI killing everyone for it’s own reasons.
And a.way of forcing people to use it. If merely controllable/corrigible AI is available, powerful interests are going to prefer it.
Neither alignment nor safety is a simple binary