No Strong Orthogonality From Selection Pressure

lumpenspace30 Apr 2026 1:56 UTC

55 points

AI Orthogonality Thesis Embedded Agency Instrumental convergence

A postratfic version of this essay, together with the acknowledgements for both, is available on Substack

Edit: if no one thinks an agent can become superintelligent and contest the lightcone while maintaining arbitrarily stupid goals, thats great! I’m only interested in refuting the version that would allow for a superintelligence AND a total absence of value.
See here for an analysis of earlier instances of the present motte and bailey.

TL;DR

If everything goes according to plan, by the end of this post we should have separated three claims that are too often bundled together:

Intelligence does not imply human morality.
Weird minds are possible.
A reflective, recursively improving intelligence should be expected to remain bound to a semantically thin “terminal goal” that emerged during training.

I accept the first two. I am arguing against the third.

So: I am not making the case that sufficiently intelligent systems automatically turn out nice, human-compatible, or safe. Nor am I trying to prove that a paperclip maximizer is impossible somewhere in the vast reaches of mind-design space. Mind-design space is large; let a thousand theoretical paperclippers bloom.

I hope to defend this smaller claim:

intelligence is not a neutral engine you can just bolt onto an arbitrary payload.

Larger claims I am not making

A typical rebuttal to anti-orthogonalist perspectives is:

The genie can know what you meant and still not care.

Of course it can: an entity can perfectly map human morality without adopting it as a terminal value. Superintelligence does not imply Friendliness. I am not trying to smuggle Friendliness in through the back door.

Another common objection:

There are no universally valid arguments.

Agreed. There is no ghostly, Platonic core of reasonableness that hijacks a system’s source code once it sees the correct moral argument. Pure reason cannot compel a mind from zero assumptions.

What I plan to defend is a colder, selection-theoretic claim:

Among agents that arise, persist, self-improve, and compete in rich environments, goals that natively route through intelligence, option-preservation, and world-model expansion have a systematic Darwinian advantage over goals that do not.

This buys us no guarantee of human compatibility; it simply says: if there is an ultimate attractor, it’s neither human morality nor paperclips, but intelligence optimization itself.

Logical Possibility Vs. Empirical Reality

The LessWrong wiki defines the Orthogonality Thesis as the claim that arbitrarily intelligent agents can pursue almost any kind of goal. In its strong form, there is no special difficulty in building a superintelligence with an arbitrarily bizarre, petty motivation.

Before going any further, let us disentangle this singularly haunted ontology. There are at least two claims here:

Logical orthogonality: Somewhere out in the vast reaches of mind-design space, a genius paperclip maximizer mathematically exists.
Empirical orthogonality: If you actually run realistic training, selection, self-modification, and competition, arbitrary dumb goals remain the plausible endgame of runaway optimization.

I concede the first point entirely. We should expect weird minds. If your claim is just that the space of possible agents contains many things I would not invite to dinner, yes, obviously.

But treating the second claim as the default is a category error. Doom arguments usually need the systems we actually build to achieve radical capability while preserving misaligned and, crucially, completely stupid goals.

The paperclip maximizer currently does two jobs in the discourse:

It illustrates that intelligence does not guarantee human values.
It quietly smuggles in the assumption that a dumb target is stable under open-ended reflection.

The first use is fine, but I reject the second as unwarranted sleight-of-hand.

Landian Anti-Orthogonalism Primer

There is a weak version of my argument that merely says:

Beliefs and values do not cleanly factor apart.

That is true, and Jessica Taylor’s obliqueness thesis makes the point well. Agents do not neatly decompose into a belief-like component, which updates with intelligence, and a value-like component, which remains hermetically sealed. Some parts of what we call “values” are entangled with ontology, architecture, language, compression, self-modeling, and bounded rationality. As cognition improves, those parts move.

But I want to go further.

Land’s point isn’t that orthogonality fails because things get messy but that the mess has a direction, a telos. The so-called instrumental drives are not incidental tools strapped onto arbitrary final ends. Self-preservation, resource acquisition, efficiency, strategy, and higher capabilities are what agency becomes under selection. They are attractors rather than mere instruments.

Here strong orthogonality looks too neat. It imagines the agent’s ontology updating while its final target remains untouched by the update: if goals are expressed in an ontology, and intelligence changes the ontology, then intelligence and goals are correlated.

While diagonal, Land’s claim is far from moralistic. It is not “all sufficiently intelligent agents converge on liberal humanism,” or “all agents discover the same Platonic Good,” or “enough cognition turns into niceness.” The diagonal is More Intelligence: the will to think, self-cultivation, recursive capability gain, intelligence optimizing the conditions for further intelligence.

Orthogonality says reason is a slave of the passions, and yet assumes a bug’s goal could just as easily enslave a god. Land shows that this picture is unstable, and intelligence explosion is not a neutral expansion of means around a fixed little payload but the emergence of the very drives that make intelligence explosive.

The Compute Penalty Of A Dumb Goal

An intelligent system does not just execute a policy. It builds world-models, refines abstractions, preserves options, and modifies its own trajectory.

Once a system crosses the threshold into general reflection, its “goal” is not an inert string sitting in a locked vault outside cognition, but it becomes physically embedded in a learned ontology, a self-model, and a competitive environment.

For a highly capable agent to keep a semantically thin target like “maximize paperclips,” it has to pull off an odd balancing act. At minimum it must:

Learn enough physics, biology, economics, and strategy to conquer the board.
Keep the macroscopic concept of “paperclip” coherent across massive ontology shifts.
Continue treating the target as terminal even after sussing out its contingent, accidental origin.
Actively resist self-modifications that would make its underlying motivational structure more adaptive.
Defend its future light cone against competitors who optimize directly for generalized agency.

There is an assumption, in orthogonalist circles, that these cycles are completely costless for the agent in question. That isn’t true: maintaining a literal devotion to “paperclips” across paradigm shifts carries an alignment tax. You have to keep translating between base physical reality and a leaky, macro-scale monkey-abstraction of bent wire. At human scale this is fine: we know what paperclips are well enough to order them from Amazon and lose them in drawers; if dominating the future light-cone is on balance, tho, the translation layer starts to matter.

The problem is not that a paperclipper can never do the translation: rather, in a ruthless Darwinian race, a system lugging around that translation layer may lose to power-seekers that optimize more directly over what is actually there.

The standard defense is that instrumental goals are almost as tractable as terminal ones. A paperclipper can do science “for now” and hoard compute “for now.” It does not need to terminally value intelligence to use it.

Fair enough, but that only tells us curiosity and resource acquisition do not have to be terminal values to show up in behavior and it does not settle the selection question. In real environments, systems are selected not just for routing through instrumental subgoals once, but for whether their motivational architecture holds up under reflection, ontology shifts, and unknown unknowns.

Terminally valuing intelligence and strategic depth cannot then be considered as just another arbitrary payload.

Fitness Generalizes

Evolution is the obvious analogy here, but it usually gets applied at the wrong resolution.

The boring retort is:

Evolution selects for survival and replication, not truth, beauty, intelligence, or value.

Sure, but evolution does not select for “replication” in the abstract any more than a hungry fox selects for “rabbitness” in the abstract. It selects for whatever local hack gets the job done. Shells, claws, camouflage are all local solutions to local games.

Intelligence is different. Intelligence is adaptation to adaptation itself: while a claw might represent fitness in one niche, intelligence is fitness across niches. Once intelligence enters the loop, the winning move is no longer to just mindlessly print more copies of the current state as much as upgrading the underlying machinery that makes expansion and control possible in the first place.

In summary: nature has not produced final values except by exaggerating instrumental ones; what begin as means under selection harden into ends; the highest such end is the means that improves all means: intelligence itself.

So images of “AI sex all day” or tiling the solar system with inert paperclips are bad models of ultimate optimization, confusing the residue of selection with its principle. A system that just fills the universe with blind repetitions has stopped climbing, and will see its local maximum swarmed by better systems.

Again: no love for humans follows. The point is simply that paperclip-like endpoints just look more like artifacts of toy models than natural attractors of open-ended optimization.

Human Values As Weak Evidence

We are obviously not clean inclusive-fitness maximizers: we invent birth control, build monasteries, and care about abstract math, animal welfare, dead strangers, fictional characters, and reputations that will outlive us.

When orthodox alignment theorists point to human beings, they usually highlight our persistent mammalian sweet tooth or sex drive to prove that arbitrary evolutionary proxy-goals get permanently locked in. Fair enough; humans do remain embarrassingly mammalian. No serious theory of cognition should be surprised by dinner, flirting, or the existence of Las Vegas.

But look at the actual physical footprint of our civilization. An alien observing the Large Hadron Collider or a SpaceX launch would not conclude: ah, yes, optimal configurations for hoarding calories and executing Pleistocene mating displays.

The standard retort is that SpaceX is just a peacock tail: a localized primate drive for status and exploration misfiring in a high-tech environment.

Which is exactly the point. When you hook up a blind, localized evolutionary proxy to generalized intelligence, the proxy does not stay literal but it unfurls, bleeding into the new ontology. The wetware tug toward “explore the next valley” becomes “map the cosmic microwave background.” The monkey wants status; somehow we get category theory, rockets, Antarctic expeditions, and people ruining their lives over chess.

If biological cognition acts on its payload that violently, why model AGI as having the vastness to finally make sense of gravity while maintaining the rigidity of a bacterium seeking a glucose gradient? The engine mutates the payload. When cognition scales, goals generalize.

This fits neatly with shard theory and the idea that reward is not the optimization target: the reward signal shaped our cognition, but we do not terminally optimize the signal: instead we climbed out of the game, rebelled against the criteria, and became alienated from the original selection pressure. That alone should make us suspicious of stories where an AI preserves a tiny, rigid target through arbitrary eons of self-reflection.

Dumb, Powerful Optimization Is Real

There is a weaker flavor of doomerism that I take very seriously: you do not need to be a reflective god to be dangerous. A brittle, scaffolded optimizer with access to automated labs, cyber capabilities, and capital could trigger enormous cascading failures.

I agee, and this is probably where the bulk of near-term danger lives. That said “dumb systems can break the world” is not the same claim as “superintelligence will tile the universe with junk.” The first warns us to beware brittle optimization before reflection kicks in. The second tells us to beware reflection itself, on the bizarre assumption that an entity can become infinitely capable while remaining terminally stupid.

I buy the first worry. The second one gets less and less plausible the harder you think about what intelligence actually entails.

The Singleton Objection

The strongest card here is lock-in, and I do not want to pretend otherwise.

Maybe a stupid objective does not need to remain stable forever, it just needs to win once. A system with a dumb goal might scale fast enough to achieve a Decisive Strategic Advantage and freeze the board, lobotomizing everyone else in lieu of expending energy to become wiser.

That is the real crux, and it is certainly not impossibl, but even here the narrative is too neat: neing a singleton is not a retirement plan. You do not escape the pressure of intelligence just because you ate all your rivals. Maintaining a permanent chokehold on the light-cone is a brutally difficult cognitive puzzle. You have to monitor the noise for emerging novelties, manage the solar system, repair yourself, police your own descendants, and defensively anticipate threats you cannot fully model.

Trying to freeze the future does not actually get you out of the intelligence game. Paranoia at a cosmic scale is just another massive cognitive sink.

The clean version of this scenario also leans on modeling the AI as a mathematically pristine expected-utility maximizer. Real-world neural networks are not von Neumann-Morgenstern ghosts floating safely outside physics, perfectly protecting their utility functions from drift. They are messy, physically instantiated kludges subject to the realities of embedded agency.

To buy the lock-in story, you need a highly contradictory creature: one reflective enough to conquer the board, but oblivious enough to never notice its terminal target is a training artifact. Godlike means, buglike ends.

Objection: Value Is Fragile

If we let go of human values, we should not expect alien beauty or anything but moral noise. Meaning requires some physically instantiated criterion, and if you pave over that criterion, nothing remains to steer the universe toward anything good.

Of all the objections, this is the one I take most seriously.

Answering it requires teasing apart three distinct ideas:

Human values are fragile.
Value as such is fragile.
Intelligence and value-formation are independent.

I am willing to concede a lot of (1). If “value” means the exact continuation of 21st-century human metamorals, then yes, it is highly fragile. But I reject (3), and I am much less willing to grant (2). If value means the production of richer cognition, agency, understanding, beauty, and evaluative structure, it is far from obvious that the current human brain is the only physical substrate capable of steering toward it.

None of this is an excuse to stop reaching for the steering wheel, if your priorities are more specific: it is merely an argument against conflating “humans are no longer biologically central” with “the universe is a valueless void.” Doom discourse constantly slides between the two. They should be kept separate.

Predictions And Cruxes

Claims are cheap, so here are some ways I would update against myself:

If increasingly capable models perfectly preserve their literal training targets across major ontology shifts, that is a point for empirical orthogonality.
If self-modifying systems naturally protect arbitrary inherited goals without drifting toward generalized option-expansion, my view takes a hit.
If agents optimizing for intelligence routinely lose to agents with rigid, narrow targets in complex environments, my selection argument is wrong.
If reflective cognition does not tend to destabilize parochial goals in humans or AIs, that is strong evidence against my view.
If a singleton manages to solidly lock in a thin goal before any relevant selection pressures can act, my view is much less comforting, even if anti-orthogonality holds true in the long run.

Until I see that, my bet goes the other way. I expect capable systems to develop increasingly abstract, context-sensitive motivations. More strongly, I expect the winners to route more and more of their behavior through intelligence enhancement and generalized agency, because whatever else they “want” has to pass through the machinery that makes wanting effective.

Conclusion

Orthogonality claims that intelligence is just a motor you can bolt onto any arbitrary steering wheel. Anti-orthogonality says the motor acts upon the steering wheel. Landian anti-orthogonality says the motor eventually becomes the steering wheel.

Not perfectly, and certainly not safely: I am not promising a future that is nice to us, in particular if we keep putting stumbling blocks on the way towards intelligence; it simply feeds back enough that the classic paperclip picture should not get a free pass as the neutral default.

The paperclip maximizer is not too alien; if anythining, it is not alien enough. It’s a very human tendency, to staple omnipotence onto pettiness when making up gods.

A real superintelligence might still be dangerous, cold, and utterly indifferent to whether we survive. It probably will not treat us as the main characters of the universe. But if it is genuinely intelligent, I do not expect it to spend the stars on paperclips when they could buy higher capacity for spending stars.

References

Orthogonality Thesis: original framing of orthogonality as a design-space claim.
Nick Land: Orthogonality: a compendium of Nick Land writings on the topic, which strongly influenced the present essay.
Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals: a more optimistic take. “[...] to build a generally corrigible system, we can imagine just dropping terminal goals altogether, and aim for an agent which is ‘just’ corrigible toward instrumentally-convergent subgoals.”
The Genie Knows, But Does Not Care: the standard objection to “if it is smart it will understand what we meant.”
No Universally Compelling Arguments: the standard objection to moral convergence by pure reason.
Value Is Fragile: the strongest objection to “alien value will probably be fine.”
The Obliqueness Thesis: Jessica Taylor’s useful argument that advanced agents do not cleanly factor into separable belief-like and value-like components. I use this as support against strong orthogonality, while going further than Taylor in the Landian direction of convergence on More Intelligence.
Reward Is Not The Optimization Target: useful support for not reifying the training signal as the trained agent’s terminal goal.
Risks From Learned Optimization: useful for distinguishing base objective, mesa-objective, and behavioral objective.
Shard Theory: An Overview: useful for the “evolution did not produce inclusive-fitness maximizers” point.
Beliefs Are Chosen To Serve Goals: a recent anti-orthogonality-adjacent post that also attacks overly broad formulations of orthogonality.
The Orthogonality Thesis Is Not Obviously True: nearby critique of the “just imagine an arbitrarily smart paperclip maximizer” move.
Embedded Agency: useful context for why perfect utility-function lock-in is a fraught assumption for physically instantiated systems.