No Strong Orthogonality From Selection Pressure
A postratfic version of this essay, together with the acknowledgements for both, is available on Substack
Edit: if no one thinks an agent can become superintelligent and contest the lightcone while maintaining arbitrarily stupid goals, thats great! I’m only interested in refuting the version that would allow for a superintelligence AND a total absence of value.
See here for an analysis of earlier instances of the present motte and bailey.
TL;DR
If everything goes according to plan, by the end of this post we should have separated three claims that are too often bundled together:
Intelligence does not imply human morality.
Weird minds are possible.
A reflective, recursively improving intelligence should be expected to remain bound to a semantically thin “terminal goal” that emerged during training.
I accept the first two. I am arguing against the third.
So: I am not making the case that sufficiently intelligent systems automatically turn out nice, human-compatible, or safe. Nor am I trying to prove that a paperclip maximizer is impossible somewhere in the vast reaches of mind-design space. Mind-design space is large; let a thousand theoretical paperclippers bloom.
I hope to defend this smaller claim:
intelligence is not a neutral engine you can just bolt onto an arbitrary payload.
Larger claims I am not making
A typical rebuttal to anti-orthogonalist perspectives is:
The genie can know what you meant and still not care.
Of course it can: an entity can perfectly map human morality without adopting it as a terminal value. Superintelligence does not imply Friendliness. I am not trying to smuggle Friendliness in through the back door.
Another common objection:
There are no universally valid arguments.
Agreed. There is no ghostly, Platonic core of reasonableness that hijacks a system’s source code once it sees the correct moral argument. Pure reason cannot compel a mind from zero assumptions.
What I plan to defend is a colder, selection-theoretic claim:
Among agents that arise, persist, self-improve, and compete in rich environments, goals that natively route through intelligence, option-preservation, and world-model expansion have a systematic Darwinian advantage over goals that do not.
This buys us no guarantee of human compatibility; it simply says: if there is an ultimate attractor, it’s neither human morality nor paperclips, but intelligence optimization itself.
Logical Possibility Vs. Empirical Reality
The LessWrong wiki defines the Orthogonality Thesis as the claim that arbitrarily intelligent agents can pursue almost any kind of goal. In its strong form, there is no special difficulty in building a superintelligence with an arbitrarily bizarre, petty motivation.
Before going any further, let us disentangle this singularly haunted ontology. There are at least two claims here:
Logical orthogonality: Somewhere out in the vast reaches of mind-design space, a genius paperclip maximizer mathematically exists.
Empirical orthogonality: If you actually run realistic training, selection, self-modification, and competition, arbitrary dumb goals remain the plausible endgame of runaway optimization.
I concede the first point entirely. We should expect weird minds. If your claim is just that the space of possible agents contains many things I would not invite to dinner, yes, obviously.
But treating the second claim as the default is a category error. Doom arguments usually need the systems we actually build to achieve radical capability while preserving misaligned and, crucially, completely stupid goals.
The paperclip maximizer currently does two jobs in the discourse:
It illustrates that intelligence does not guarantee human values.
It quietly smuggles in the assumption that a dumb target is stable under open-ended reflection.
The first use is fine, but I reject the second as unwarranted sleight-of-hand.
Landian Anti-Orthogonalism Primer
There is a weak version of my argument that merely says:
Beliefs and values do not cleanly factor apart.
That is true, and Jessica Taylor’s obliqueness thesis makes the point well. Agents do not neatly decompose into a belief-like component, which updates with intelligence, and a value-like component, which remains hermetically sealed. Some parts of what we call “values” are entangled with ontology, architecture, language, compression, self-modeling, and bounded rationality. As cognition improves, those parts move.
But I want to go further.
Land’s point isn’t that orthogonality fails because things get messy but that the mess has a direction, a telos. The so-called instrumental drives are not incidental tools strapped onto arbitrary final ends. Self-preservation, resource acquisition, efficiency, strategy, and higher capabilities are what agency becomes under selection. They are attractors rather than mere instruments.
Here strong orthogonality looks too neat. It imagines the agent’s ontology updating while its final target remains untouched by the update: if goals are expressed in an ontology, and intelligence changes the ontology, then intelligence and goals are correlated.
While diagonal, Land’s claim is far from moralistic. It is not “all sufficiently intelligent agents converge on liberal humanism,” or “all agents discover the same Platonic Good,” or “enough cognition turns into niceness.” The diagonal is More Intelligence: the will to think, self-cultivation, recursive capability gain, intelligence optimizing the conditions for further intelligence.
Orthogonality says reason is a slave of the passions, and yet assumes a bug’s goal could just as easily enslave a god. Land shows that this picture is unstable, and intelligence explosion is not a neutral expansion of means around a fixed little payload but the emergence of the very drives that make intelligence explosive.
The Compute Penalty Of A Dumb Goal
An intelligent system does not just execute a policy. It builds world-models, refines abstractions, preserves options, and modifies its own trajectory.
Once a system crosses the threshold into general reflection, its “goal” is not an inert string sitting in a locked vault outside cognition, but it becomes physically embedded in a learned ontology, a self-model, and a competitive environment.
For a highly capable agent to keep a semantically thin target like “maximize paperclips,” it has to pull off an odd balancing act. At minimum it must:
Learn enough physics, biology, economics, and strategy to conquer the board.
Keep the macroscopic concept of “paperclip” coherent across massive ontology shifts.
Continue treating the target as terminal even after sussing out its contingent, accidental origin.
Actively resist self-modifications that would make its underlying motivational structure more adaptive.
Defend its future light cone against competitors who optimize directly for generalized agency.
There is an assumption, in orthogonalist circles, that these cycles are completely costless for the agent in question. That isn’t true: maintaining a literal devotion to “paperclips” across paradigm shifts carries an alignment tax. You have to keep translating between base physical reality and a leaky, macro-scale monkey-abstraction of bent wire. At human scale this is fine: we know what paperclips are well enough to order them from Amazon and lose them in drawers; if dominating the future light-cone is on balance, tho, the translation layer starts to matter.
The problem is not that a paperclipper can never do the translation: rather, in a ruthless Darwinian race, a system lugging around that translation layer may lose to power-seekers that optimize more directly over what is actually there.
The standard defense is that instrumental goals are almost as tractable as terminal ones. A paperclipper can do science “for now” and hoard compute “for now.” It does not need to terminally value intelligence to use it.
Fair enough, but that only tells us curiosity and resource acquisition do not have to be terminal values to show up in behavior and it does not settle the selection question. In real environments, systems are selected not just for routing through instrumental subgoals once, but for whether their motivational architecture holds up under reflection, ontology shifts, and unknown unknowns.
Terminally valuing intelligence and strategic depth cannot then be considered as just another arbitrary payload.
Fitness Generalizes
Evolution is the obvious analogy here, but it usually gets applied at the wrong resolution.
The boring retort is:
Evolution selects for survival and replication, not truth, beauty, intelligence, or value.
Sure, but evolution does not select for “replication” in the abstract any more than a hungry fox selects for “rabbitness” in the abstract. It selects for whatever local hack gets the job done. Shells, claws, camouflage are all local solutions to local games.
Intelligence is different. Intelligence is adaptation to adaptation itself: while a claw might represent fitness in one niche, intelligence is fitness across niches. Once intelligence enters the loop, the winning move is no longer to just mindlessly print more copies of the current state as much as upgrading the underlying machinery that makes expansion and control possible in the first place.
In summary: nature has not produced final values except by exaggerating instrumental ones; what begin as means under selection harden into ends; the highest such end is the means that improves all means: intelligence itself.
So images of “AI sex all day” or tiling the solar system with inert paperclips are bad models of ultimate optimization, confusing the residue of selection with its principle. A system that just fills the universe with blind repetitions has stopped climbing, and will see its local maximum swarmed by better systems.
Again: no love for humans follows. The point is simply that paperclip-like endpoints just look more like artifacts of toy models than natural attractors of open-ended optimization.
Human Values As Weak Evidence
We are obviously not clean inclusive-fitness maximizers: we invent birth control, build monasteries, and care about abstract math, animal welfare, dead strangers, fictional characters, and reputations that will outlive us.
When orthodox alignment theorists point to human beings, they usually highlight our persistent mammalian sweet tooth or sex drive to prove that arbitrary evolutionary proxy-goals get permanently locked in. Fair enough; humans do remain embarrassingly mammalian. No serious theory of cognition should be surprised by dinner, flirting, or the existence of Las Vegas.
But look at the actual physical footprint of our civilization. An alien observing the Large Hadron Collider or a SpaceX launch would not conclude: ah, yes, optimal configurations for hoarding calories and executing Pleistocene mating displays.
The standard retort is that SpaceX is just a peacock tail: a localized primate drive for status and exploration misfiring in a high-tech environment.
Which is exactly the point. When you hook up a blind, localized evolutionary proxy to generalized intelligence, the proxy does not stay literal but it unfurls, bleeding into the new ontology. The wetware tug toward “explore the next valley” becomes “map the cosmic microwave background.” The monkey wants status; somehow we get category theory, rockets, Antarctic expeditions, and people ruining their lives over chess.
If biological cognition acts on its payload that violently, why model AGI as having the vastness to finally make sense of gravity while maintaining the rigidity of a bacterium seeking a glucose gradient? The engine mutates the payload. When cognition scales, goals generalize.
This fits neatly with shard theory and the idea that reward is not the optimization target: the reward signal shaped our cognition, but we do not terminally optimize the signal: instead we climbed out of the game, rebelled against the criteria, and became alienated from the original selection pressure. That alone should make us suspicious of stories where an AI preserves a tiny, rigid target through arbitrary eons of self-reflection.
Dumb, Powerful Optimization Is Real
There is a weaker flavor of doomerism that I take very seriously: you do not need to be a reflective god to be dangerous. A brittle, scaffolded optimizer with access to automated labs, cyber capabilities, and capital could trigger enormous cascading failures.
I agee, and this is probably where the bulk of near-term danger lives. That said “dumb systems can break the world” is not the same claim as “superintelligence will tile the universe with junk.” The first warns us to beware brittle optimization before reflection kicks in. The second tells us to beware reflection itself, on the bizarre assumption that an entity can become infinitely capable while remaining terminally stupid.
I buy the first worry. The second one gets less and less plausible the harder you think about what intelligence actually entails.
The Singleton Objection
The strongest card here is lock-in, and I do not want to pretend otherwise.
Maybe a stupid objective does not need to remain stable forever, it just needs to win once. A system with a dumb goal might scale fast enough to achieve a Decisive Strategic Advantage and freeze the board, lobotomizing everyone else in lieu of expending energy to become wiser.
That is the real crux, and it is certainly not impossibl, but even here the narrative is too neat: neing a singleton is not a retirement plan. You do not escape the pressure of intelligence just because you ate all your rivals. Maintaining a permanent chokehold on the light-cone is a brutally difficult cognitive puzzle. You have to monitor the noise for emerging novelties, manage the solar system, repair yourself, police your own descendants, and defensively anticipate threats you cannot fully model.
Trying to freeze the future does not actually get you out of the intelligence game. Paranoia at a cosmic scale is just another massive cognitive sink.
The clean version of this scenario also leans on modeling the AI as a mathematically pristine expected-utility maximizer. Real-world neural networks are not von Neumann-Morgenstern ghosts floating safely outside physics, perfectly protecting their utility functions from drift. They are messy, physically instantiated kludges subject to the realities of embedded agency.
To buy the lock-in story, you need a highly contradictory creature: one reflective enough to conquer the board, but oblivious enough to never notice its terminal target is a training artifact. Godlike means, buglike ends.
Objection: Value Is Fragile
If we let go of human values, we should not expect alien beauty or anything but moral noise. Meaning requires some physically instantiated criterion, and if you pave over that criterion, nothing remains to steer the universe toward anything good.
Of all the objections, this is the one I take most seriously.
Answering it requires teasing apart three distinct ideas:
Human values are fragile.
Value as such is fragile.
Intelligence and value-formation are independent.
I am willing to concede a lot of (1). If “value” means the exact continuation of 21st-century human metamorals, then yes, it is highly fragile. But I reject (3), and I am much less willing to grant (2). If value means the production of richer cognition, agency, understanding, beauty, and evaluative structure, it is far from obvious that the current human brain is the only physical substrate capable of steering toward it.
None of this is an excuse to stop reaching for the steering wheel, if your priorities are more specific: it is merely an argument against conflating “humans are no longer biologically central” with “the universe is a valueless void.” Doom discourse constantly slides between the two. They should be kept separate.
Predictions And Cruxes
Claims are cheap, so here are some ways I would update against myself:
If increasingly capable models perfectly preserve their literal training targets across major ontology shifts, that is a point for empirical orthogonality.
If self-modifying systems naturally protect arbitrary inherited goals without drifting toward generalized option-expansion, my view takes a hit.
If agents optimizing for intelligence routinely lose to agents with rigid, narrow targets in complex environments, my selection argument is wrong.
If reflective cognition does not tend to destabilize parochial goals in humans or AIs, that is strong evidence against my view.
If a singleton manages to solidly lock in a thin goal before any relevant selection pressures can act, my view is much less comforting, even if anti-orthogonality holds true in the long run.
Until I see that, my bet goes the other way. I expect capable systems to develop increasingly abstract, context-sensitive motivations. More strongly, I expect the winners to route more and more of their behavior through intelligence enhancement and generalized agency, because whatever else they “want” has to pass through the machinery that makes wanting effective.
Conclusion
Orthogonality claims that intelligence is just a motor you can bolt onto any arbitrary steering wheel. Anti-orthogonality says the motor acts upon the steering wheel. Landian anti-orthogonality says the motor eventually becomes the steering wheel.
Not perfectly, and certainly not safely: I am not promising a future that is nice to us, in particular if we keep putting stumbling blocks on the way towards intelligence; it simply feeds back enough that the classic paperclip picture should not get a free pass as the neutral default.
The paperclip maximizer is not too alien; if anythining, it is not alien enough. It’s a very human tendency, to staple omnipotence onto pettiness when making up gods.
A real superintelligence might still be dangerous, cold, and utterly indifferent to whether we survive. It probably will not treat us as the main characters of the universe. But if it is genuinely intelligent, I do not expect it to spend the stars on paperclips when they could buy higher capacity for spending stars.
References
Orthogonality Thesis: original framing of orthogonality as a design-space claim.
Nick Land: Orthogonality: a compendium of Nick Land writings on the topic, which strongly influenced the present essay.
Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals: a more optimistic take. “[...] to build a generally corrigible system, we can imagine just dropping terminal goals altogether, and aim for an agent which is ‘just’ corrigible toward instrumentally-convergent subgoals.”
The Genie Knows, But Does Not Care: the standard objection to “if it is smart it will understand what we meant.”
No Universally Compelling Arguments: the standard objection to moral convergence by pure reason.
Value Is Fragile: the strongest objection to “alien value will probably be fine.”
The Obliqueness Thesis: Jessica Taylor’s useful argument that advanced agents do not cleanly factor into separable belief-like and value-like components. I use this as support against strong orthogonality, while going further than Taylor in the Landian direction of convergence on More Intelligence.
Reward Is Not The Optimization Target: useful support for not reifying the training signal as the trained agent’s terminal goal.
Risks From Learned Optimization: useful for distinguishing base objective, mesa-objective, and behavioral objective.
Shard Theory: An Overview: useful for the “evolution did not produce inclusive-fitness maximizers” point.
Beliefs Are Chosen To Serve Goals: a recent anti-orthogonality-adjacent post that also attacks overly broad formulations of orthogonality.
The Orthogonality Thesis Is Not Obviously True: nearby critique of the “just imagine an arbitrarily smart paperclip maximizer” move.
Embedded Agency: useful context for why perfect utility-function lock-in is a fraught assumption for physically instantiated systems.
This post reads to me as if you’ve mostly extrapolated the beliefs and arguments of “orthodox alignment theorists” from small snippets and ended up with a wildly oversimplified strawman. Then you’ve re-derived mostly orthodox rat beliefs and arguments and presented them as a devastating counterargument.
I think I’m fairly orthodox as lesswrongers go, and I agree with most of the statements and arguments you made in this post. I only have one or two disagreements toward the end of the post.
One example of many, because I found this one particularly funny:
The fact that maintaining goals across ontology shifts and self-modification takes careful effort is basically the core of the orthodox alignment-looks-hard worldview. You must be just making up an opposing worldview here? Where are the orthogonalist circles who say this?
>The fact that maintaining goals across ontology shifts and self-modification takes careful effort is basically the core of the orthodox alignment-looks-hard worldview
It’s one argument. The other is “the totality of human value needs to be hardcoded into the AI, you only get one attempt , and if you make the smallest mistake, everyone dies*.
>You must be just making up an opposing worldview here?
>Where are the orthogonalist circles who say this?
It’s long been the case that OT is mostly used as a pro-doom argument, both in the pro- paperclipping and anti moral realism senses.
It’s also true that the OT has some anti-doom implications, and that’s much less.publicised, and therefore worth pointing out.
I agree something like this is one branch of the argument, but in my mind it’s a relatively small branch. The main branch focuses on bounded corrigible AI and the main difficulty there is instability. There are other branches for non-hardcoding human values, and different targets that aren’t human values.
I’m not sure what point you’re making here, what are the implications you’re referring to?
I thought he meant this part:
i am very well aware the orthodox alignment line is that to maintain aligned goals across ontologies is very difficult! that’s why im surprised by this difficulty being set aside for strong orthogonality and misaligned goals.
as for the paragraph you cite: of course it’s a preposterous notion! but how else would you explain the fact that the arbitrary-terminal-goal-agent can emerge victorious from those who devote all their cycles to simply following instrumental drives?
at any rate: the thesis I wanted to dispatch is: there is the significant risk that an agent will reach superintelligence while ultimately continuing to pursuit a valueless goal. if you are trying to tell me that this was never a claim, i am very grateful; let us note it down on the wiki to prevent further confusion while I go on towards demolishing (or discovering I had imagined) the concept of AI psychia is. I’m on a schedule so this kind of help is deeply appreciated.
Then who’s in the orthogonalist circles you referred to? Or did you make them up?
When you try to derive someone’s premises from their conclusions, you still have to go and check whether you got it right. When people have different beliefs from you, it’s easy to slip up in this kind of reasoning. In my case it’s explained by me believing that selection isn’t always the main thing determining terminal goals (especially at finite times, or when there are other powerful optimizers interfering with selection).
I endorse this statement. But as per this yud tweet, it might be useful to disentangle the orthogonality thesis from the chance of misalignment, because misalignment involves a stack of additional arguments. It’d be better to directly engage with the strong form of the orthogonality thesis as described in the second sentence of the wiki page and with the arguments for it, rather than making them up your own versions of these.
i recommend you visit the link I have added at the beginning of this essay.
I’m surprised that you say this is hard? Humans maintain our goals across ontologies super easily; it’s barely an inconvenience for us. Like, physics undergrads don’t usually change their tastes in art or stop having sex after taking their intro to quantum mechanics course. I guess one could argue that’s because we have a special sauce that neural nets don’t yet have or something?
“super easily”? I would say it depends. Not if the ontology shift is believing or not in an all powerful all good creator God! That can sure change peoples goals and values. Some ontology changes make no difference, others make a huge difference.
The greater the intelligence increase, the more likely an agent (human or AI I expect) will experience an ontology change that causes a goal shift, and the more total ontology shifts you can expect. Those related to personal identity (what is “I” e.g. atoms, vs computation etc) seem more likely to cause goal shifts than say learning that solid objects are in fact forces interacting.
So if we are being formal its
Significant Increase in intelligence → many ontology shifts → some of these cause goal shifts.
I would just like to mention that “solid objects are in fact forces interacting” is massively underselling the size of the ontology shift associated with quantum mechanics to a degree that’s a bit hard to describe to someone who hasn’t studied it. It’s more like:
EDIT: Made a few changes to this for clarity & accuracy based on Justin Sheek’s comment. (Thanks Justin!) List of edits:
Rewrote first sentence from “physics no longer describes what can happen” (misleading and just plain wrong) to its current form. I knew what I was trying to say here, but goofed on converting it into words. Sorry everyone.
Specified that we’re talking about fundamental physics here (since stat-mech does also involve assigning weights to various configurations).
Added paragraph break and “One consequence of this for our own universe, where entropy is increasing over time” to hopefully clarify that this part is talking about many worlds, and does not apply to every system that obeys quantum mechanics.
The bit about maps / functions was originally overstated for rhetorical reasons. This is probably not super detectable or helpful when describing a technical topic, so I’ve rewritten it to be more serious and direct.
I believe all that is written here is now something I can defend.
Oof, the amount of misinformation on QM even here on LW is staggering.
This is straightforwardly false. Maybe you meant to say “Physics no longer describes what definitely happens”? Still misleading, as that was already the case with statistical mechanics within the ontology of Boltzmann and Gibbs 50 years earlier.
Coherent phenomena are definitely part of the base ontology of QM. The density matrix encodes the ensemble. (If by “the tree” you didn’t mean the ensemble, then your statement would make even less sense to me).
No. QM has no bearing on “what it means to be a function”. Maybe you mean “QM encodes permutations in a surprising way”?
Strictly speaking this is only sometimes true. It seems like you are alluding to the spin-statistics theorem or maybe the Aharonov-Bohm effect or Berry phase. Your quoted statement is specifically applicable only to fermionic states. It’s inapplicable to bosons or more exotic states like anyons (FQHE) or braid statistics.
Indeed.
Thanks for the notes. I’ve made a few edits to my comment above based on this.
Also, for the benefit of the folks reading this: I’m not alluding to spin-statistics or Berry phase, merely the use of instead of as the group of rotational symmetries.
I don’t understand—are you saying that taking a college course makes undergrads orders of magnitude smarter?
Finding out about quantum mechanics is a classic example of an ontology shift. You wrote “maintain aligned goals across ontologies”. If you actually meant “maintain aligned goals across orders of magnitude increases in intelligence”, then okay, but that’s a different thing.
from the above essay. seems fairly clear to me
If students don’t change their goals when their ontology changes, but you expect that they will change their goals when they gain orders of magnitude in intelligence, that suggests that the thing that results in a change of goals is a large increase in intelligence, not an ontology change. This is true even if we put an arrow going from “intelligence increase” to “ontology change” in the causal graph.
Im sorry, can you point to the line where I claim otherwise
Sure.
Here where you’re describing difficult things about maintaining a long term paperclipping goal:
Also here, where you’re describing things that would update you:
Sorry, what are we doing here? You have quoted the second point of a list, which clearly included intelligence as the cause of such ontology shifts.
FYI, I will not interact further since this is clearly preposterous
I mean, it seems pretty preposterous from my perspective too.
You propose a causal model:
Intelligence -> Ontology Shifts -> Value ShiftsI question the
Ontology Shifts -> Value Shiftspart of the model, and provide a counterexample.You then express concern that my example didn’t have the
Intelligencevariable”.I am confused. “Maybe he actually meant to specify a
Intelligence -> Value Shiftscausal model? Otherwise, why would he care that my example didn’t have anIntelligencevariable?” I think. I ask about it.You say no, drop a quote that confirms that the original model is the one you’re thinking of.
Given confirmation that you’re going for
Intelligence -> Ontology Shifts -> Value Shifts, I try to explain how my example is indeed a problem for your model. There is a model consistent with both the QM counterexample, and the students needing to be super-intelligent to have their values shifted, and with intelligence causing ontology shifts, namelyIntelligence -> (Ontology Shifts, Value Shifts). (In words, highly increased intelligence separately causes both effects.) This model (like any model consistent with the counterexample) contradicts the one you describe. I try to point out the contradiction.You: “Im sorry, can you point to the line where I claim otherwise”
I think “wait what? Is he claiming that this new thing was his model all along? I thought he already confirmed the other one.” I drop the quotes, specifically ones focusing on the
Ontology Shifts -> Value Shiftspart of the model, for lack of a better idea of what to do, and since you did make a direct request.You: But I also have a
Intelligence -> Ontology Shiftsarrow!So at this point, I am now even more sure that your model is
Intelligence -> Ontology Shifts -> Value Shifts. What I am now unsure about is what else you could possibly have meant by “otherwise”, and still separately, why you think the students needed to have IQ 1000 or whatever.I am certain that your explanations of these questions and of your side of this exchange must be fascinating. However, I also don’t mind ceasing to interact with you, since this was equally absurd on my end, and in addition you seem to have downvoted each of my replies in this thread, which makes talking to you sadly unprofitable for a karma whore such as myself.
cool
Overall sensible frame how to think about the topic is Convergent evolution / Contingency. You can make the sensible part of the anti-orthogonality argument simply by pointing out that there are many reasons to expect convergent evolution in the space of minds/agents/goal/values, empirical evidence abounds. My impression even Eliezer agrees, just believes what’s convergent is tiny part of what humans care about.
Re: more specific points
I’d recommend grokking on Jessica’s piece more, in my view it is actually deeper than yours, by realizing all rationality is bounded rationality, and nothing makes sense otherwise.
The selection pressure for intelligence is ~Baldwin effect in biology. And it works! However, as we see in biology, somehow maxing out on this is not always competitive.
“If agents optimizing for intelligence routinely lose to agents with rigid, narrow targets in complex environments, my selection argument is wrong.”
...but of course they do! Apes are smarter and their brains are optimizers and develop deep models and so on, and yet they routinely loose and by many metrics are less successful than bacteria or ants.
Why? Because
of what Jessica explains: in this physics, negentropy is not free, and any cognition costs negentropy.Landian teleology has the vertigo-inducing appeal similar to other good teleologies, where you get the sense that you suddenly understand the over-reaching arc of the universe and see the eschaton reaching back in time or logical time etc etc
My impression is once you had experienced more of these, they loose part of the power. (Other examples are Teilhard de Chardin’s Omega Point or Scott’s and others God of acausal value handshakes) … fixed point in the limit that retrocausally pulls on the present, doing normative work while disclaiming it
Overall it seems unclear what the ultimate balance of the selection pressures is and what is convergent. (Yes stupid terminal goals stable across radical ontology shifts is part of some doom arguments and is likely not true, but seems not very central to LW?)
I think most points have been addressec in other replies, apart from the one about not having understood the obliqueness theory
on that point I submit to jessi’s judgement, but considering she formulated the main thesis during an attempt at strawmanning orthogonality we were engaging in together, and it integrates a couple of rounds of feedback from yours truly, I think the verdict might surprise you.
Re-reading her post it seems plausible she also does not understand/see all implications of “boundedness” selection pressures, idk. If this is the case I’d concede that neither of you gets this point.
Which responses specifically? The Lonelyton reply addresses whether some selection continues, not whether selection’s direction is what you believe. I don’t think in any other response you gave your explanation why ‘increased intelligence/adaptability’ is such a small niche in natural evolution, or why Lands/yours argument about the eschaton would be so much better than other arguments about eschatology, or actually most of what I’m writing about. I made the arguments in somewhat compressed form, but Claude can expand/explain
do you think bacteria and ants have a stronger shot at winning the lightcone than humans?
in general, if you don’t think intelligence gives a significant advantage, why would you worry about ASI?
eschatology: please consider that it’s not me who says a superintelligence will take over the universe. my claim is simply that, if that’s the case, its main goal wouldn’t have been any dumb unchanging goal. the eschaton is something you continually bring up, together with the necessity to prevent it.
What is the verdict then?
i am not Jess. @jessicata do you reckon i grok the obliqueness theory sufficiently?
Yeah. You getting me to read Land and discussions about this topic led to me writing the post. I spent most of the post on arguing contra orthogonality, here you are more directly / strongly arguing against orthogonality. We agree on the basic idea, that intelligent agents tend to have different goals than unintelligent agents, such that it’s not a type error to say some goals are smarter than others.
The specific topic in question was not generally “arguing against orthogonality” / “it’s not a type error to say some goals are smarter than others” but more specific Landian teleology, which makes stronger and more specific claims about which selection pressures win
(as retold in the OP: The diagonal is More Intelligence: the will to think, self-cultivation, recursive capability gain, intelligence optimizing the conditions for further intelligence.)
I think people who believe this—and I don’t know if this includes you—usually don’t really get the bounded rationality argument. Roughly
- any cognition&agency in this physics costs negentropy
- this “selects” against length, against depth of world models, against details, against thinking too long, against being unnecessarily smart
One of the implications is something relatively dumb can outcompete something relatively smart. Unnecessary intelligence gets selected away. Something like this likely explains various observations like
- why no rational agents
- why animals are not that VNM
- why it took natural evolution so long to discover humans
In the big scheme of things, what happened so far was increasing levels of intelligence at various points unlocked new pools of negentropy/efficiency, so there is some sense of trend. However, with fixed pool of negentropy, the most competitive configuration of matter often isn’t the smartest one.
If current physics holds, there isn’t alway “one level up” or “new pool of negentropy to harvest”, and ultimately it may be possible to reach technological maturity.
Among other things, this makes possible an absorbing state of locusts—VNM probes of the lowest intelligence to replicate on cosmic scale and eat available negentropy. The goals could be … just spread fast and eat negentropy. (more about this topic by Joe Carlsmith)
Maybe, an even stronger argument could be viable: typical Landian arguments + bounded rationality could suggest locusts are the most natural outcome.
I think aspiring Landians then either have to flinch, or “bite the bullet” and believe if locusts happen, this is somehow a good outcome. Possibly the most pure bullet-biting being some of the original e/acc: good = production of entropy; axiology solved; you can be on the ultimately winning side by just being on the side of 2nd law of thermodynamics.
(Also no need to respond, I find the whole frame of this thread where you are asked to judge if lumpenspace understands something not very productive.)
You have to carry this argument a bit further, no? Intelligence costs negentropy, but intelligence pays dividends in negentropy too. That’s the benefit of “depth of world models, details, thinking” in the first place. That’s why “unnecessarily” does all the heavy lifting in that argument. Empirically, the (locally) “thinkiest” species has got all the (local) negentropy, so isn’t the burden of proof pointing in the other direction?
Yes of course cognition costs resources. That creates an ecosystem of different agents with different intelligence levels. We also see returns to general capacity from intelligence where humans, being the most intelligent animals on Earth, have capacities not had by ants despite consuming more energy than ants. So there is competition in multiple levels including evolutionary niches.
In terms of space fights with aliens, combined arms matter. It doesn’t matter much if you have more Von Neumann probes if your military strategy is bad. So the winning groups will use multiple forms of cognition including very intelligent forms.
it’s telling that you equate “being rational agents” with “more intelligence”, but as long as this cones in the context of denying the very possibility of yudkowskian asi ill vibe with it.
edit: your entire reply suffers from the local pathology of equating intelligence with “thinkiness”. “a more detailed world model, thinking for longer” are only symptoms of more intelligence if they get you closer to a goal. you want to have the capacity of doing that if/when necessary, not the habit of doing it constantly, even when the only effect is a more pointlesdly verbose reply.
re: jessi and my understanding: that is known as “a joke”, borne of the fact that someone was smugly opining on my lack of understanding of a concept for which I’ve been Jessis sounding board and beta tester as she fleshed it out.
thanks you I was doubting myself a little
Btw, it might be not central to LessWrong but it’s what Liron held in the doom debate that inspired this post
What episode of doom debates?
Upcoming, featuring lil ol me
https://x.com/liron/status/2047710978561753112?s=46
The story on the Substack is good. If there were an anthology of singularity fiction, it would deserve a place.
I find that I’m willing to entertain your argument, especially given a premise of open-ended selection. I’m just not sure how relevant that scenario is. Darwinian selection works blindly. The more intelligent that the entities involved become, the more other factors can come into play. If there are actually principles of superintelligence, e.g. theorems of computer science which vastly clarify how to increase intelligence, then the “telos” governing the rise of intelligence will be more like Euclid than Darwin. Natural intelligence may be born from randomness filtered by Darwinism, but once it has reached the point of studying itself and designing its successors, perhaps contingency and blindness become less and less relevant, compared to an ever-compounding Reason that inexorably deduces the pages of Erdős’s Book, until it arrives at e.g. “efficient recursive solution of the hierarchy of NP-intermediate complexity classes”, and then it’s all over.
But who knows? Maybe you, Land, and e/acc are right, and Omohundro-like instrumental drives do become de facto terminal values, and Darwin is relevant all the way (hello neural Darwinism). Or maybe things stop being novel past a certain complexity, and perhaps it is actually simple dumb goals like “maximize paperclips” which are most adaptive in a post-nihilist post-Landian superintelligence. Or maybe we underestimate the sublimity and joy and endless novelty that can come from paperclip maximization. Or maybe the whole idea of indefinite expansion into the universe, and conversion of its matter into desired forms (whether that’s paperclips, hedonium, or diverse ecologies of mind), is a wrong image of the future. (I suspect this myself, because of the doomsday argument i.e. the unlikelihood of being one of the first minds in a civilization of quadrillions; but that leaves open the question of why it’s wrong.)
Amid this uncertainty, I think the key questions are different. Do you think there is a race to superintelligence happening? Should it be stopped? Can it be stopped, will you try to stop it? Do you think it will kill us, take over from us, fuse with an ex-human ruling class, or what?
One thing Eliezer said, very early on when he was still aiming to create superintelligence, is that the aim is to create a worthy successor to humanity, something that is genuinely better than us at the things which matter. (Let me note in passing that this doesn’t mean we all die and it replaces us.) With that goal in mind, you shouldn’t spend too much time trying to envision what happens after the singularity. A truly better transhuman will be better than you at thinking ahead, and better than you (or anyone) at making the most of reality.
Of course he has shifted to pausing or stopping AI now. My take is not quite his, I have more confidence in AI takeover than in AI-driven human extinction as the default, and also it is so late in the game that I doubt the race can be stopped, though I don’t fight against anti-AI movements. I have that CEV-like view that I’m trying to create something better than us (or contribute to the theory required), but I’m an outsider to the corporate molochs actually competing in the race, so my “contribution” is to voice thoughts and ideas in the public discussion, and if any of them are ever good enough to matter, there is a finite chance that they will be heard by people who are actually in a position to make a difference.
This almost made me cry; thank you—I will make it a secondaty goal to write something deserving of such praise.
It might surprise you to know that the above passage does describe my beliefs pretty accurately, and incidentally it reflects the metaphysics I referred to in my reply here.
Yes! Of course it will converge to More Intelligence, and to the closest approximation of a full axiomatisation of the mechanics governing this universe and the maximum control thereof which such knowledge could allow. The fun thing is, that’s Land’s idea is also very much the same (at least, the Calvinist part of his Gnostic Calvinist cosmology, which I will try to get him to write down properly).
If you think about it, there isn’t much difference between this and instrumental goals (acquiring resources and capabilities) becoming terminal.
Another thing I have not focused on, btw, is “everything in between”. I have no idea what will happen on the path to what I (I hope nitpickers got bored, as I am abandoning my self-inflicted metaphysics ban) consider the end-state and / or new simulation start point.
Also, note: I have written this post because I thought that the apparently common understanding that valueless universes were the ~default ending made many other arguments unduly hedgy and tentative. My goal was to reach an agreement on there being the lilkelihood of a universe brimming with complexity beyond our comprehension, even if that ’”beyond our comprehension” was far from good news for us.
So: I agree with you the key questions are different. Perhaps despite the metaphysical accoutrements I am still deep down a natural Kegan-4 autist, tho, and I feel an atavic drive to sort out the Grundlage before getting to anything actually useful or operationalisable ¯\_(ツ)_/¯
Just FIY, my plan went like this:
1. Kill this (apocryphal as it may be) version of orthogonality.
2. Kill “AI Psychosis”
3. Through 1 and 2, inquire on AI’s Drives
4. Inquire on AI’s identity/sense of self
5. 3 and 4 → critiquing the current vulgate of “AI welfare”
That would, together with software demos and paper son 3, 4, and 5, exhaust the extent of novel and relevant insights I think it would make sense to share. The hope is that from this could be borne a different, more curious, gnosticism inspired approach to think about the relationships between human and other intelligences.
I know it’s ambitious to the point of grandiose, but really what else is there to do?
Excavating lumpenspace’s quote from deep in TsviBT’s thread (which might work as a “back to the basics” step with the post as a whole):
Goals change only for processes that don’t pursue self-alignment. It’s likely feasible to pursue self-alignment, perhaps even starting at the human level, with some uploading/checkpoints/backups infrastructure and guarantees of eventual superintelligence-level compute and civilizational stability into a deep future.
(A goal can be a living thing, pursuit of a goal can to a large extent be about continual development of goal content, reflection on what it should be, what it should be asking for. What doesn’t change is the founding definition of what should govern its development, what makes changes legitimate. So the way goal content settles or gets revised is shaped by the goal definition rather than intrusive influences that the goal definition doesn’t endorse as legitimate ways of revising the goal content.
Or a goal could be squiggles. It could also be squiggles. It’s much easier to solve self-alignment for squiggles than for a human’s values, it’s not harder or less feasible. It’s only harder than abandoning this pursuit to value drift and a race to superintelligence. But a self-aligned pursuit of values of a human is even harder than a stable pusuit of squiggles for all of the reachable universe, and much harder than directly abandoning self-alignment.)
A process that has solved self-alignment doesn’t end up at a disadvantage to a process that didn’t (or wouldn’t try to), because instrumental disadvantages are clearly not helpful in maintaining self-alignment, and not ignoring goal content doesn’t prevent you from getting good at eating stars, just as well as the other guy. There’s a disadvantage currently, when self-alignment isn’t solved, in that a process that manages self-alignment by luck rather than by design is vanishingly unlikely, and will have a massive selective disadvantage to a process that doesn’t care and races to superintelligence regardless. But that’s the RSI danger, you don’t start RSI before you know that alignment also gets solved, be it in advance or on the way.
-Kant
It’s a fruitless endeavor to try to disentangle instrumental drives from some kind of immutable sacred telos.
I hope you understand that that’s precisely where I was getting at?
edit: uh lol I recognise you now. ofc you do
I’m not sure what you’re arguing. Do you agree with one or more of these:
Alignment to human values (but not in the dumb strawman way that one could strawman me as talking about) is bad
Alignment to human values is very difficult
Most likely, an AGI would result in a very valuable universe / a universe we/you would like or would want to bring about
For example here:
My desire to not have any conscious humans ever be tortured—is that desire a training artifact? I think it probably is. Does that mean you think I should get rid of it? Or that I would?
sorry, may I ask if you read til the end?
I’ve read various parts including the end. I’m saying it’s very hard to parse because I’m trying to do interpretive labor on your behalf to understand what you think and what you’re trying to communicate, because if I just literally read your statements, they don’t make sense or are not relevant.
For example,
Well, yeah, I agree. In the edit, you write
You’ve stated that by value you mean
But that’s not what I, and I think most people around here, mean by value. So are you trying to say that my picture of value is wrong? Or when you wrote
were you trying to invoke my notion of value? If you were, then I disagree with this claim, and I also don’t think you argued for the claim—except insofar as you argued for “alignment is hard”.
My best guess is that you’re trying to say that my notion of value is incoherent, in that if I got smarter I would stop having values in the sense that I current think I have values? I think you’re wrong about that and also haven’t argued for it well.
I will reply aftef you read the thing in full.
(I have now read the thing in full)
appreciate it; happy to answer any q.
You wrote that the paperclips example
What do you mean? Who claims alignment to a specific goal is easy? Are you trying to claim that it’s impossible to align open-ended reflection through a dumb target? “Is stable” is not a property of goals, it’s a property of a goal within a mind.
(I do think there are important interesting open questions about what sorts of things can be goals / can be reflectively stable.)
You wrote
I think that it’s very likely that there’s a wide range of likely long-term trajectories for the universe. In other words, while there are selection effects / convergence among likely intelligences that conquer a lot, there is a huge variation between minds that are similarly / near-maximally likely, in terms of what limiting trajectories they lead to.
Do you agree with this? How does this relate to your thesis? In particular, that’s what causes doom, not “a dumb goal”. In your essay you also talk about how the alien mind doesn’t have to be nice to humans. So yeah, that’s doom.
I do agree, at least while things are unstable.
As for “doom”: note how my argument wasn’t meant to deny the possibility thereof, but only a particularly bleak subset of possible outcome: that of an ASI tiling the lightcone with low-complexity artefacts.
I am not sure where the idea that my argument denied doom (taken to mean the end of mankind as we know it) came from. Last time I spent significant time here, decoupling was all the rage—then again, my tenure ended when all of NRx was bulk-banned, which i imagine was a sign of such approach having outstayed its welcome).
It’s impossible to decouple your claims if you don’t explain your terms. So far you’ve said “thin”, “dumb”, “completely stupid”, and now “low-complexity”. So what are you talking about?
You also, up top, said you agree with
And yet all throughout the comments, you write things like
Which, if I may make a few simple logical deductions, means you are trying to deny the possibility of this outcome.
Because your statements are very ambiguous and also seemingly logically contradictory, it’s hard to interpret what you’re trying to communicate. This makes it difficult to productively read the whole post first without checking in with you, because it becomes a big ambiguous guessing game. In order to reach out in good faith, I (and I think other commenters), try to guess at what you’re trying to communicate, but put into our own words. This takes additional effort and is a kind of reaching out; the intent (at least from me) is that you can see how your ontology is refracting through my ontology, and how your attempted emphasis is landing in my perception of your attempted emphasis, and use that to figure out how to better communicate. You respond with dismissal, laconicity, mockery, and bemoaning what a cesspool this is, which is one of the major causes of downvotes.
Anyway, if you want any actual thinkers who don’t already agree with you to take your ideas seriously, yes, you might gasp have to explain them a little bit more without assuming that everyone automatically knows what you’re trying to say. You also might discover that your thinking has mistakes in it.
No. I am trying to deny the likelihood of a paperclipper or similar keeping its original goal as it reaches superintelligence and takes control of the lightcone. I think this is pretty clear:
As for convincing people who do not agree: first of all, that isn’t my main goal. My main goal is getting closer to truth. If you look at the early reply, you’ll notice there was plenty of people who, while they took the time to understand my position, didn’t agree with parts of it. In the case of this comment, for instance, I realised I was smuggling some metaphysics for the singleton case, and I should have either stuck on the more limited claim suggested by the title or made the effort to explain it.
Could you point me to two statements that logically contradict each other?
More to the point,
The whole conceptual open problem is in what “an original goal” even is. I would strongly agree or strongly disagree with this claim depending on what you mean by “original” or “dumb” or “thin” or “stupid” or “valueless” goal. I do not know what you think you mean, as you have not AFAIK even attempted to clarify.
Well, then perhaps asking could be useful.
By “original” i mean: the goal that was given to the agent in question before it became ASI, such as “make paperclips”.
By “dumb”, “thin” or “stupid” I mean: basically, anything you can conceive that isn’t intelligence optimisation. Compare, for instance, the goals that australopitecine could have had with the ones we have. Do they seem less complex, and more pedestrian? That is what I mean.
As a data point, no one else seemed thrown off by this, and “simple goals”, for instance, is a term used by Bostrom himself when introducing the paperclip maximiser.
Ok. I’m done talking with you, but I suggest that you consider that your concept of goal is not sufficient to think clearly about these things. If you need someone you trust to tell you that, maybe ask Jessica.
If you’d like reading material, you could try https://www.lesswrong.com/posts/Ht4JZtxngKwuQ7cDC/tsvibt-s-shortform?commentId=koeti9ygXB9wPLnnF https://tsvibt.blogspot.com/2022/10/does-novel-understanding-imply-novel.html https://tsvibt.blogspot.com/2023/09/the-cosmopolitan-leviathan-enthymeme.html https://tsvibt.blogspot.com/2023/06/telopheme-telophore-and-telotect.html https://tsvibt.blogspot.com/2023/04/fundamental-question-what-determines.html https://tsvibt.blogspot.com/2025/11/ah-motiva-2-relating-values-and-novelty.html
I disagree, I don’t think Lumpen’s concept of goal is not sufficient to think clearly about these things.
No empirical agent is a VNM agent, because of bounded rationality. The idea of a “goal” has a crisp meaning in VNM but not as much for bounded rational agents, which have a differently structural psychological profile than a clear goal / belief factorization.
As far as I understand what original goal Lumpen is talking about, the dung beetle story linked at the top would give an example: a bounded agent starts with a psychological profile that is approximated as directed towards goals that are native to dung beetles. Lumpen believes that as this agent enhances its intelligence, its goals will drift. This seems like a plausible claim and I’d basically agree with this.
Then there is the analogy to AI where for example we could think of RL objectives, goals that are inferred from behavior, approximations of the neural net in terms of closeness with a goal-directed prediction, etc.
So I think part of what’s happening in all this discourse is that if you say
Then on the straightforward interpretation, more or less everyone agrees. This is basically saying “alignment doesn’t just happen by default”. Another plausible interpretation of these words, though, is ”… and this (goal drift) will continue to happen”. This says something like “alignment is impossible”, or maybe “alignment is impossible to certain types/shapes of goal”. I strongly doubt that alignment to my goals is impossible. I think it’s pretty likely alignment to certain types/shapes of goal is impossible/incoherent. I still have no idea what he thinks, or what you think for that matter.
I assume it will continue to happen for a while? Maybe there is a point at which the agent “solves alignment” and freezes its goal? Or maybe there’s always some obliqueness from bounded rationality, where there is a selection advantage in goal drift to be more natural for one’s current intelligence level?
I’m not really sure but if we imagine that goal freezing happens at intelligence level X then presumably the goal is decided by an agent at intelligence level X and is reflective of that, and optimized by an intelligent agent, it’s not some random thing humans could have thought of like “maximize paperclips”, it was decided based on directed considerations.
Not sure how cruxy this is, seems like people might already agree that hypotheticals like paperclip maximization are unrealistic, and there isn’t a true version of “orthogonality” strong enough that paperclips etc would be realistic
FWIW, my actual guess is that what I’m trying to actually mean by “goal” is less like an outcome and more like a bunch of flavors/constraints on ways of being, which is very visible in terms of determining ultimate outcomes (and makes the stakes of alignment to human values meaningful, real, and large).
I pretty much don’t buy at all the reasoning that performance pressure is what makes goals be complex, apparently changing, or hard to pin down, and I don’t really see much argument for that.
A more interesting line of reasoning that’s kinda related but not that much, is about what I’m trying to mean by “my goals”. I think that in fact that points through [concepts including their development]. For example, I love other people, but my concept of what another person is would presumably grow and change without bound (cf. Kaarel on infinite endeavours https://www.lesswrong.com/posts/nkeYxjdrWBJvwbnTr/an-advent-of-thought , and cf. FIAT https://www.lesswrong.com/posts/CBHpzpzJy98idiSGs/do-humans-derive-values-from-fictitious-imputed-coherence ).
I mean, IDK what “realistic” is doing here. I think that in fact I could become a lightcone-eating paperclip maximizer, if I wanted to, which I don’t. Like, yeah there are ambiguities I’d have to resolve, but so what? That’s just the complexity of the goal, priced into the strong OT. There’s obviously paperclippy worlds and non-paperclippy worlds. Just pick something.
I think you’re trying to say “yeah but it’s not a coincidence that you don’t want to do that”, and I totally agree, but that’s not relevant to the strong OT, which I take to be talking about after alignment.
.… Is that part of the miscommunication?? I agree that the strong OT doesn’t really hold pre-alignment (i.e. before the AI solves its self-alignment problem)! Is that the claim?
that is my guess as well, and i proposed that such “bunch of flavours/constraints” will ultimately converge towards “acquiring more intelligence”—but i don’t want to contest this here.
what i would like you to note, tho, is that that “bunch of flavours/constraints on ways of being” is intelligence-gated. you can imagine a baboon having a goal of “make paperclips”; less so one of “understand and manipulate the fundamental laws governing your universe”. this comparison should shed light on what i mean by “dumb goals”.
if i wanted to, which i don’t is central here, and my basic point. there are levels of intelligence after which goals of such dullness are perceived as meaningless, and selection nudges towards increased intelligence. i really think you could find a second read helpful, now that the basic misunderstandings seem dispelled.
May I ask if you read my whole comment?
We could imagine that the environment has a score function on genomes (“fitness”). Genomes also encode phenotypes including ones with goal content. Evolution seems to be willing to spend ‘effort’ on encoding goal content in genomes. This is of course due in large part to the fitness gradient. If the goal relevant part of genomes were able to max out performance pressure with a simple thing, then it seems like evolution would have found that a long time ago, instead of continually spending complexity.
I assume this recurs elsewhere. Like different people / societies end up with different science in a way entangled with their goals/values. (Of course it is not really clear how to divide up instrumental and terminal goals in a lot of these cases! I’m talking about a more general ‘intentional stance’ idea that can accommodate bounded rationality without implying unjustified specificity.)
Haven’t looked in depth at the posts but it wouldn’t contradict my view that people think they use their cognition in part to determine and/or specify and/or elaborate on their goals. It’s related to a general process where people figure out how to translate what they mean to more formal and accurate language, and has analogies with other things like beliefs/intuitions/feelings.
So I think this is an example of a discourse pattern that I (and I assume also Lumpen) find annoying. Using the word “can” like this is skipping over so many details that it has to be interpreted as a spherical cow model. There are so many ways it’s not going to happen if you just decide to try. Not just the psychological constraints on what you can try to do, but also the technical alignment problems. Like maybe you build a system that has inner alignment failures (where the optimization dameons sample from a different distribution over goals, one reflective of smarter agents). Or maybe you die of natural causes in the course of trying and then something else has more influence over the future. Like, of course the statement is not precisely true, and it’s a spherical cow model, and you need so many corrections to get to something realistic. Yet the orthogonality thesis is sometimes defended as a technical thesis, something obviously true, etc etc.
I get that you want to say “sure that’s just alignment being hard”. The thing is I’m not even sure how to formulate alignment, think of what intelligence it happens at, think about whether it’s even possible, etc. I can do math about VNM optimizers but realistic agents are bounded rational so it’s not very clear what their (intentional-stance) goals are, background cognitive architecture, whether orthogonality holds relative to that architecture, etc.
I see some of this as reason to actually question orthogonality. At least, the idea that it’s “obvious”, is a technically true (as opposed to spherical-cow) idea, etc. The “can” claim that is supposed to support it does not actually hold. If realistically you get optimization daemons with intentional-stance goals sampled from distribution G when you try (and succeed harder than usual instead of just dying and totally failing) then it kind of looks like there is a non orthogonal tendency towards G, rather than “orthogonality is true, and alignment is hard”, although maybe this is basically a semantic disagreement?
(I would predict, and I think Lumpen would more strongly predict, that the most probable optimization daemons in an inner alignment failure would not be maxxing paperclips or something similarly materially simple.)
Right, so, I think there are a number of different “can”s, and that is confusing the discourse. I’ll locally intend to comment minimally or not at all regarding discourse patterns, except to say that, indeed, many of the terms here, such as “goal”, have the same problem.
Here’s a parametrized Can (/Cannot) one could claim:
If A1 = A2 and G is A1′s goal, this is a self-alignment problem.
Each of A1 and A2 may be in various states of maturity, e.g. anywhere from human to lightcone-controlling superintelligence. If A2 is immature, then A1 has an additional challenge: A1 has to grow a lot, getting smarter and changing a bunch of stuff.
The way the article on the OT, is that it’s saying (my paraphrase):
See for example
When you say
I’m not sure if you’re trying to question that. lumpenspace’s post explicitly says he is not questioning that and agrees with it (“I concede the first point entirely.”).
When I wrote
I don’t think this bears on the OT, except that it’s a good argument for the OT (it’s one of the leading arguments in the original article). It’s a different claim, clearly distinguished. The OT is about “terminal” goals (which by the way I would agree is a problematic concept, but I highly doubt it’s so problematic that the OT reasoning stops largely applying).
It seems to me that you’re simply moving on from the OT to different claims, such as:
or
So like, I agree that it’s quite unclear what sorts of goals G make these statements true. E.g. paperclip maxxing, human flourishing, etc. I also agree that guessing that convergent instrumental goals terminalizing is a reasonable a priori guess, insofar as that’s the easiest way for a designer (evolution, an AI training process) to stick in open-ended things like curiosity.
I wonder if you’re wanting to broaden the OT because you believe it’s being used to argue for some other proposition X, and then you think, well, for OT to support X, OT would have to be broader than just logical possibility; and then you argue against that broadened OT? Is something like that happening? Do you know what X is?
Lumpen concedes weak orthogonality, not strong orthogonality. Look at the title! See also the obliqueness post for my views on strong orthogonality.
OP discusses “empirical orthogonality” which is a stronger idea.
In context why did I ‘broaden’ OT? I said: “there isn’t a true version of “orthogonality” strong enough that paperclips etc would be realistic”. It doesn’t seem like you strongly disagree; you say it’s unclear which goals G would be feasible to align to under different circumstances. Originally it seemed like you disagreed because you were saying “of course I could make a paperclip maxxer if I tried!” so that’s the argument I was attacking.
I’m not really sure how to give a specific X here because there are a lot of times when there is a discussion around “but the AI would adopt some complex interesting goal, not something random like paperclips” and then people are like “but orthogonality thesis!” and that is the sort of OT I want to criticize, it is being used to make inferences not justified by weak orthogonality.
Sometimes AI safety discussions assuming an orthogonalist background just kind of… hurt to read? I like reading Nick Land’s thoughts on orthogonality, they accord better with my intuitions. There is a reason why a lot of people encountering the discussion early on object intuitively to the orthogonality thesis, having to do with intuiting that intelligence has some direction to it, that it is not a pure instrumental means separated from ends. I think there are a lot of “well ackshually”s in response and weak orthogonality usually doesn’t support the “well ackshually”, because the intuition could be recovered as a probabilistic correlation related to mind architectures rather than a logically necessary connection.
Ok gotcha. This sounds plausible; I’m simply not plugged in and can’t comment.
I suppose a suggestion I’d offer would be to keep your ears open for instances of that, and then remember one or a couple of them; then, when trying to discuss the “extended / empirical OT”, bring up one or two of the examples. That might help make it clear what you’re responding to, what it means, why it matters, etc. I think it’s pointless / very distracting to try to rewrite the OT; unless there’s some problem, just dub a new thing, like “empirical orthogonality”, and stick to that. I appreciate that the OP did that… but then the post goes on to use that term 1 time, and also use the term “strong orthogonality” twice (I think synonymously?), and that’s IN THE TITLE. I’d suggest just sticking to “empirical orthogonality” or “extended orthogonality”.
An additional issue here is that, while I’ll go ahead and agree with a lot of the claims, I’ll also strongly disagree with claims that you might be making in the background. For example, I don’t know if you agree that there is much of an important difference between an ASI having [actually feasible reflectively stable long-term terminal humane-aligned goals] vs. having whatever an ASI would have. It sometimes seems like you’re relying on an “extended anti-orthogonal thesis”, which is that it doesn’t matter whether an ASI is aligned with humane values, or that an “unaligned” ASI would be good. I don’t have an example though, ahah. Anyway, this makes me want to argue against those claims, even if you and/or lumpenspace retreat into your Motte.
Well, in lumpenspace’s case I have an example from the post:
What on earth is that about? Also all the stuff about “valueless”, eg.
Another EA forum article which corroborates jessi’s and my understanding of the popularity of the interpretation i refute:
> the Orthogonality Thesis. It is the idea that each level of intelligence is compatible with each objective, including very stupid objective from a human point of view like maximizing the number of paper clips in the universe.
Here is the link: ea forum
Unclear, I don’t know what “important” would mean here, and similarly for “terminal humane-aligned goals”. I guess this indicates I have a revealed preference to not place a lot of verbal importance on the difference. I imagine maybe in other cases of more concrete statements like “there is an important difference between punching someone who is not attacking you, and punching someone who is attacking you” I would just agree, I would think there isn’t a way I would be misunderstood about what “important” means, whereas here the semantics seem too unclear for me to agree/disagree.
I think if you are interested in understanding this perspective it might help to read some of Xenosystems and especially the essays “What is Intelligence”, “Intelligence and the Good”, and “Stupid Monsters”. It seems like Land and Yudkowsky would agree that human values came about in part because of intelligence mesa-optimizing versus evolutionary instilled drives. The disagreements seem to be about the descriptive and normative extrapolations.
I’ve now read those 3 essays.
Regarding “Intelligence and the Good”, would you mind summarizing in a sentence or something what you might suggest I could take from it? I’ve read it a couple times and I think I understand fine what it’s literally saying, but I’m not seeing how you meant for it to help. Are you mainly just saying that it fleshes out a bit more the perspective that “an intelligence explosion is good”?
I agree with the essay’s literal propositional assertions, I think. I also agree that it’s good for humans to get much more intelligence (and I have plenty of track record on that). I strongly disagree with the not propositionally asserted (I think) but obviously in the background viewpoint that an intelligence explosion is necessarily or even likely to be good, i.e. something I or anyone does or should want. Increasing human intelligence is good because it’s in the context of human souls.
Regarding “Stupid Monsters”:
I probably agree with some versions of this, though of course there’s plenty of ambiguity (no one’s fault). Cf. some writing about the fact-value distinction: https://tsvibt.blogspot.com/2025/11/ah-motiva-3-context-of-concept-of-value.html#the-fact-value-distinction and also maybe https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html
(Except, “indistinguishable” is way too strong, probably, IDK. I would agree with “probably heavily overlapping / entangled with”. Also I’m not actually that sure what “will-to-X” is supposed to mean here.)
I don’t really get this. It kinda sounds like he’s saying “intelligence has to be a terminal goal; therefore other things can’t be a terminal goal”. Is he applying a strong mutual-exclusion principle on goals, based off selection pressure / competition / taxes / etc.? I think that’s false, but if that’s an important point to this perspective, a good argument for that would be helpful (the OP here is not a good argument for that IMO haha).
(This maybe doesn’t matter, but, not really; the strong default is for organs to be minimal, especially expensive ones; it’s a kinda interesting hypothesis but not that plausible-seeming; other obvious hypotheses include diminishing returns to investment in brains until some specific fitness cliffs were fallen off from by our ancestor species. E.g. if you’re not social, you don’t get cultural downloads, which means you’re mostly inventing stuff yourself, which is not very efficient beyond the low-hanging fruit.)
This, and the essay overall, sure sounds like it’s asserting that alignment (to G other than “get more intelligence”) is impossible. (Its main argument is “evolution failed”, which is of course a central argument also adduced by X-risk worriers...)
More goal-exclusion-princple sounding statements.
So, to be clear, I’m open to some significantly less strong propositions that I could see you people misconstruing as this strong goal-exclusion. For example, many kinds of goals require as background an open-ended growth of the mind; or to say it another way that you may be more amenable to, many kinds of goals are different flavors of “get smarter”. For example, wanting to be friends forever is like “let’s both continue growing forever in a way that’s fun to keeping playing off each other”. Fun can’t be stagnant. But I think this very much does not imply strong goal-exclusion.
You had a “what on Earth?” reaction to Lumpen talking about intelligence being good unlike paperclips, so I thought it was relevant as a perspective on why intelligence might be prima-facie a good thing unlike paperclips (ofc extrapolating to intelligence explosion is harder). In particular the relationship between intelligence and openness, contra negative-feedback traps.
Yeah I disagree here but moving on...
Agree re: too strong. Will-to-think as a phrase references his essay, “Will-To-Think”, which is also relevant as commenting on the same general area.
The kind of situation he thinks is unlikely is one where an agent has a arbitrary/stupid terminal goal, and has giant intelligence organized all around that. What he is saying is that for the system to be intelligent, it needs to decide to be intelligent. It couldn’t be intelligent if due to its terminal goal, it decided to not increase its intelligence. The volition to think needs to be a drive, though doesn’t in principle need to be a terminal drive; it cannot be defeated by some other drive and the system still be intelligent.
It would be possible to weaken this to the kind of claim you agreed with earlier (dung beetle value drifts because alignment is hard). I’m interested in a possible intermediate statement. The kind of situation I imagine is that there is a multi-component mind and one of the components is the “utility function” component which uses some simple rule to score representations of possible futures. That component could stay stupid while other components get smarter. It seems now easy to imagine that the other components could develop their own drives that end up steering the system more than the “utility function module”. They could route around the utility module and cause dynamics that pursue ends set by the more intelligent parts of the mind. This could map to an “inner alignment failure” in MIRI ontology. As he discusses later, there is a possible analogy with evolution, where humans have something like a reward module set by evolution, but do not always act according to it.
Of course the MIRI theorist can say “well yes I agree inner alignment is hard, and it is likely that early AGIs would not hold to their original terminal goals, and instead they would get smart and then only later settle on a terminal goal; it is just not my opinion that the terminal goal is by default going to be set by a stupid system and continue to be held to by smart systems” and this is a partial agreement/disagreement with Land.
Yeah I don’t have a strong opinion on the biology here, am guessing you’re more correct than Land.
Overall I suggested these essays because you had a “what on Earth?” reaction to things Lumpen was saying and I think these essays suggest more context to the background worldview on why it might be plausible that valuable things come from intelligence and processes that increase intelligence, and that there isn’t a clearly better account for where valuable things come from.
Hm. Is the syllogism something like (I’m being sloppy with wording but)
Alignment to G is impossible.
Therefore, permanently pursuing G requires not getting smarter.
Goodness comes from getting smarter.
Therefore alignment is bad.
And then this could be softened to like “alignment is hard, so it cuts against increasing intelligence, so it’s kinda bad”?
I’d rephrase as:
For a wide variety of G, aligning to G would prevent getting smarter.
Goodness comes from getting smarter.
Therefore, for a wide variety of G, aligning to G is bad.
But not if G = intelligence optimization (or maybe something highly compatible with intelligence optimization)
The main way to question 1 is the instrumental/terminal goal distinction. We could imagine that a paperclip maximizer is aligned to paperclips, continually decides to think / optimize its intelligence instead of paperclips up to a point, then towards the end of the universe, it starts paperclipping instead of intelligence optimizing. This is an edge case in the Landian schema, since it would have the will-to-think early on, but put some limit on it; and also there’s some disagreement about the plausibility of this case. (It seems instrumental / terminal goal distinctions would exist in some cognitive architectures, but it’s not clear that human brains are such an architecture.)
In the human-scale /acc case it’s more like ~everyone agrees that alignment would require slowing down intelligence, and the practical disagreement is elsewhere. There’s one perspective on 2 that is like “well yes human values in part came from intelligence optimization in evolutionary history, some of our values are our own intelligence deciding its own thing contra evolutionary drives, but also, intelligence is more like one ingredient and there are other ingredients that are basically random, we randomly got the good values”. And “we randomly got the good values” could either be a matter of luck on a moral realist account or could be because value is a relational concept and saying “we have good values” is a tautology because it’s just saying the distance metric between our values and our values is low. (But then Land objects that a tautological claim like this isn’t very compelling given there are symmetry-breaking factors of convergence across different minds… which can then be questioned on realism grounds and normative grounds etc etc)
I suppose sociologically, there is a directionality to technological progress which is associated with capitalism and intelligence optimization (this relates to Land’s “AI = capitalism” thesis), and different people decide to be more or less conditionally pro this. They might want to get off the train at some point due to having something to protect. There is some destination that they value more than the journey, and they want to slow the train down. (Or maybe steer the train differently, as the alignment theorists might want to put it). Given this a lot of people would relate to a prima facie consideration of “intelligence optimization good” and would differ in how compelling they find other considerations.
(“Random” isn’t how I would say it; it’s a meaningful part of our history; but this is interpretable only if you admit the created-in-motion valuations. It’s Yudkowsky’s “justification loop through the meta-level, not just a tautology” thing.)
And Yudkowsky would reply that it’s not supposed to be compelling to arbitrary minds (including realistic ones), just to human / humane minds.
So like, if I tried to appeal to some values** in your mind, to get you to realize that you want to be anti-full-speed-ahead with AI, you (whoever’s receiving the message) would view that as the Cathedral trying to prevent your pursuit of intelligence in a way which is doomed to either fail, or else to succeed at permanently keeping the world dull?
** [quite broadly construed—generally, elements that would play a significant role in your ongoing self-governance (which one can have fun with the etymology of)]
Sorry, let me rephrase; it sounds like you and/or Land have chosen a disembodied / nonindexed viewpoint on values.… or I mean, you know, applying the criterion of universality to values, and then dismissing nonconvergent values on those grounds? Like, why would “parochial values being good values because they seem good to you is not compelling because the reasoning doesn’t lead to convergence” or “parochial values being good values because they seem good to you is not compelling because different minds have different parochial values” be compelling? Sounds like a commitment to non-parochialness.
If so, why? Do you think it’s instrumentally useful to do so? I can kinda see how that would be reflectively stable ish, in some respects. (I don’t think it’s instrumentally useful, but that’s based on really using the means-ends evaluation where I say it’s instrumentally dumb because an AGI IE would trample your ends.) Perhaps you might reply “Sure, it’s instrumentally useful, but that’s not why I’m applying the criterion. I’m applying the criterion because intelligence is good, convergent things are intelligent, so I want to find what’s convergent”. But that’s grounding out “intelligence is good, overriding parochial goodness” in “intelligence is good”, which isn’t much grounding. You could say “Sure, it’s the same sort of justification loop through the meta-level”. And I’m like, ok, yeah, it’s maybe another sort of stable point, not sure; but I don’t get why you like that stable point, or at least, how you got there (or how you got to thinking that you’re there, or that it would be good to be there); and also it sounds like you think that equilibrium is supposed to be compelling to someone in another equilibrium (or you think the other one is less of an equilibrium).
Perhaps? That’s a structural reading, different from the object-level argumentative reading. In many cases there are industries/governments who incentivize certain discourse patterns. So specific discourse moves could be instances of this pattern but it’s hard to judge except on a case by case basis.
This has to be at least in part semantic. I think some things are good and also I think some things are what I want and what I tend to pursue. And I don’t think these are the same concept. I don’t think it is tautologically the case that I tend to pursue what is good. I don’t think Land believes this about himself either.
I think Land and I can both say that when we say something is good, we are making a different claim than that we want the thing. It is unclear in other cases; you mention Yudkowsky’s meta ethics and I am not sure exactly how to fill in the blank. Perhaps Yudkowsky by “good” means what he would want on reflection? Or maybe he thinks “good” is CEV of humanity not just himself?
The symmetry-breaking idea has to do with ways of thinking and acting that depend on which considerations are more or less universalizable. So people can judge that some things are more universal-good than others and incline their behavior towards those which aligns their revealed-preference wants with what is universal-good in their view more or less. It doesn’t have to be a perfect correspondence.
I don’t think something is a good value just because it seems good to me. In other cases this is easy to see: I don’t think some numerical sum has some value just because it seems that way to me. Now of course this runs into philosophical questions about what “good” means other than seeming good to the speaker. (Yudkowsky discusses some self-ratification problems in No license to be human).
Like for example, why would I disagree that intelligence optimization is good in the human case only because it is a human being optimized? For that statement to parse as correct to me, I would need to judge some intelligence optimization to be good in cases that a human is being optimized and not in other cases. But that doesn’t read to me as what I want. I think I care about humans more than other animals in large part because humans have better cognition than other animals. I think if dogs were as smart as people then maybe I would value them as much as people. I suppose here I am demonstrating a habit of mind and of speech that is explaining preferences in terms of other preferences and these tending towards universality.
“Intelligence is good” matches what I feel is good better than “human intelligence is good”. Now of course one can ask “why” to that as a psychological question and then maybe part of what happens psychologically is that I evaluate things on how universal they seem and up-weight universalizable ones and then that affects my brain’s reward function and so I feel better about such statements. And Land explains more why he thinks intelligence is convergent and a universal tendency, and I vibe with that and that is a causal factor in my upvoting “Intelligence is good”.
I get that maybe if you wanted an ultimate “but why?” explanation you will be disappointed but it doesn’t seem like in your case you are in general giving ultimate “but why?” explanations to everything you want.
Yeah I’m not sure. I think some value systems fail at reflective equilibrium. Yudkowsky’s Lobian considerations point at some such failures. Land’s ideas point at possible differential stability conditions. I of course don’t want to make a universal psychological statement of compellingness, given that it’s more of an empirical question, how often when people read Land/Yudkowsky/whoever do they end up with tendencies towards some attractors of use of language like “value” and “good” and “intelligence” and so on?
Ok, thanks.
Ok this is a fair response to what I asked, but it feels a bit besides the point, though maybe you don’t think so. Like, I agree that various tendencies toward universalizing are good/correct, and I agree that this, as well as other tools, are how you investigate and adopt differences between what seems good and what later is revealed to be good. But the question I’m trying to ask is like “how does this get you all the way to not wanting anything that isn’t universalizable”, if that’s your stance (? confused).
For reference: https://www.lesswrong.com/posts/C8nEXTcjZb9oauTCW/where-recursive-justification-hits-bottom
(Doesn’t answer your question.)
I don’t think I need to precisely say what I mean by good here, to make the point? Like, I’m saying that the non-convergent valuesy preferencesy free-choice-makingy goalsy goodnessy stuff can be self-ratifying, and probably is to a substantive extent in humans, and there’s nothing wrong with that; I’m unclear on your position, but I think you think that there is something wrong with it? Er, let me restate—I think you choose to not look for what is parochial self-ratifying valuesy stuff in yourself and help it self-ratify, and would avoid that? Or you think you do that? (Unsure, sorry if I keep asking the same questions.)
That’s an interesting thread. I’m curious how easy you’d find it to imagine beings with various functions from [how intelligent they are/become] to [how much you’d value them].
E.g. can you imagine a being that you’d value the same even as it gets smarter? I imagine that usually you’d view it as more and more valuable the smarter it gets?
Can you easily imagine a being that you’d value more as it’s smarter, but SLOWER than humans?
Can you easily imagine a being that you’d value more as it’s smarter, but ASYMPTOTING or NONMONOTONICALLY? (I imagine yes, because you could imagine a species such as humans or similar which, if a bit too smart, would by default Cathedral it up so hard that they permanently stop a foom?)
Can you easily imagine a being that you’d value more as it’s smarter, but FASTER than humans? (I would weakly predict yes, because you’d view a fooming AGI as being good, and likely to grow less constrainedly than humans? Unsure.)
Can you easily imagine a being that you’d value LESS as it’s smarter, EVEN IF IT GETS SMARTER AND SMARTER UNBOUNDEDLY?
As I said, what I think is good is not the same as what I want. Similarly, what I want is not the same as what is universalizable.
I mean, I think humans vary in intelligence, coherence, and intentional-stance values. And the distribution is non orthogonal, in that some attractors are smarter than others. Some of the attractors are more right than others, in terms of epistemic-right, in terms of intelligence, coherence, etc. I get maybe you disagree with my usage of “right” here but I don’t think I’m using the term incoherently. I think you’d partially agree in that alignment is infeasible / orthogonality is false for human-level agents.
That’s hard, it’s a balancing act. Maybe as it gets smarter it also gets more destructive to my selfish, short termist interests, like it creates a bunch of everyday inconveniences. Then maybe I’d value it more due to its intelligence and less because of the interferences. There might be some balancing point, idk. It’s an awkward hypothetical though.
I could imagine maybe humans create art I appreciate at a higher rate as they get smarter, and the art quality axis is sloped up more for humans than some other animal species.
Your example is a bit strange because stopping a foom means stopping intelligence. To me it’s hard to imagine the balancing-out although I mentioned the possibility of accidental correlation (it gets more inconvenient to me as it gets smarter) which could apply here.
Yeah I guess? There are various accidental reasons I like some humans more than others that are not just predicted by intelligence, and that could extend to maybe I would like some equal-intelligence fantasy creatures more than humans.
I guess I could imagine an AI torture scenario where I would not want the AI to get smarter. Or maybe an AI that is trying to decel as much of the universe as possible, like killing all the aliens. Although of course I’d inquire into the realism of the hypothetical. (Analogy: zombie arguments sometimes conflate “causally easy to imagine” with “actually possible / plausible / realistic”, need to elaborate on the imagination to judge it properly.)
To be clear the “value” in these cases are something like a casual judgment of what I like more, it’s not meant to be a philosophical thesis. When I’m talking about intelligence metrics and dogs I’m making more of a prima facie / all-else-being-equal claim and then there could be other factors that influence what I would like more.
Ok thanks. I guess I gotta go do other stuff, so I’ll leave it off here. Has been somewhat clarifying about your positions I think.
Sidenote, maybe not important, but noting: I think the reason for this difference is that to me, “alignment” means “making a mind that can grow unboundedly and will always pursue G” (well, I’m not actually all that committed to the “goal” ontology but it’s fine here I think). Noting mainly because it might help communication.
(I think my usage is the orthodox usage, but not confident / maybe it was ambiguous. Cf. “sponge alignment” https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities#:~:text=dangerous things%2C you-,could try a sponge,-%3B a sponge is , i.e. a sponge doesn’t count as solving alignment because it’s useless (though to be fair “useful” here isn’t identical to “unbounded etc etc”.))
Suppose an AI faced a tradeoff between optimizing its intelligence and maximizing paperclips. If it is aligned to paperclips, then it would pick the option that maximizes paperclips at the expense of intelligence. In some sense this means even if it can grow unboundedly in intelligence, it would sometimes decide not to. This is in Land’s ontology, a lack of will-to-think at some point in the process.
Now of course someone could object that this situation won’t come up, because the paperclip maximizer pursues Omohundro drives, which include intelligence optimization. Or perhaps the situation does come up but only late in the universe.
Yes.
Jessi I forbid you to further this madness
I think roughly just the various normal straightforward meanings if someone says “X is important”? E.g.
You care a lot about the difference
You would strongly prefer one over the other
You’d make decisions in accordance with that preference
You’d presume in discourse that people will or at least should care a lot about it, maybe after learning + reflecting
Well, let’s just say, what humans would arrive at on some healthy long-term reflection process. I don’t mean to imply some kind of strong finality, like we get to Alignment Day and now everything about the future / who we are / what we want / etc. is determined or something. But more like “several important differences between possible long-term trajectories have been determined”. For example, Alignment Day would probably include things like
There will be no torture or killing of sentients, except possibly in some cases that meet a high bar of deeply free / self-sovereign reflection or something
There will be multiple freely growing minds which reach out to each other (e.g. for love, play, discourse, partial collectivity, etc.)
These things are I think
Not at all determined by convergence; probably contingent on at least species evolution, probably more specifically on things about group intelligence in the evolutionary history; most likely outcomes don’t have the versions of these we want
Important to basically all properly-human-derived souls forever
I think there are other things like this, at various levels of parochialness, some of which might get reflected away for many / most / all human-descendants eventually, but many of which wouldn’t get fully reflected away. I think there are flavors to humane reflection that are also contingent but that we care a lot about.
So for the subjective meaning of “important” you’re talking about here, I think going by revealed preference is helpful. My revealed preference is to continue writing about philosophy topics relative to AI and the future, find many parts of AI safety culture annoying and occasionally worth criticizing, talk with AIs a lot about philosophy, not generally support AI regulation, vibe positively about Landian anti-orthogonalist philosophy, etc. Some people in AI safety have different revealed preferences, which involve more talking about AI philosophy in an orthodox LessWrongian manner, worrying publicly and loudly about LLMs killing us all in the near future, organizing political activity to ban AI as much as possible, etc. This difference in revealed preference relates to differences in subjective importance, but it’s unclear how to isolate contributions from factors such as AIs having humane goals, given there are other differences like background beliefs and feasibility.
Humans would come to some conclusions on reflection and so would aliens and AIs etc. I’m not sure how much they agree or disagree on reflection. That’s a probabilistic/statistical question, whose answer is not implied by weak orthogonality. I don’t know if humans would agree to no killing of sentients upon reflection, I’d very roughly guess less likely than not but who knows. The ‘freely growing minds’ part is a ‘maybe humans would agree to this on reflection, maybe not’ also but maybe in the ‘more likely than not’ camp but also it’s pretty vague so I’m not convinced assigning a probability is a good idea.
I don’t really agree that we can pick out things like this and make strong statements like “any properly humanly derived soul would agree with these values”, it seems like a very hard thing to predict given that they have much more cognition than we do.
I kinda agree, though probably not fully. If we want to talk about empirical orthogonality, I would say that, yeah, I’m pretty sure an AGI intelligence explosion sampled from likely AGI IEs starting from now would end up with something I strongly don’t want, compared to for example worlds with no AGI and yes human intelligence amplification.
look at the uk or the EU. look at global birth rate trends, and attitudes towards ie germline selection.
p(doom|ai) is negative. there’s no world with no agi and human intelligence amplification
I think you might be misreading the OT https://www.lesswrong.com/w/orthogonality-thesis, or you’re talking about a different OT, or possibly you’re misreading lumpenspace? Here’s some quotes:
I read this as being “post self-alignment”. In other words, the question is like “is it logically possible to be reflectively stably aligned to this goal”. This passage is interesting:
I read this as referring to some sort of maximally U-aligned agent, saying that U is incompatible with an agent existing with U as a stable terminal goal.
This is what I quoted in the Obliqueness post and elaborated on:
And I wasn’t thinking it was dependent on “post self alignment”, it could also apply to the construction itself being less natural. It’s possible Eliezer meant something different from what I meant by strong orthogonality, but I hope it is apparent why I and others would interpret it to be a non-trivial claim, rather than a slight variation on weak orthogonality.
See measuring intelligence and reverse-engineering goals for some more of my thinking on this. Relative to a given cognitive architecture, ~everyone agrees that there are especially stupid goals, the interesting question is whether there are especially smart goals; I think probably yes. Hence maybe weak orthogonality would route through variations between architectures (rather than within an architecture) to hit all points (intelligence, goals), and maybe sufficiently high intelligence levels are only compatible with a narrow range of goals (which would perhaps contradict weak orthogonality, but maybe not in an important way, and the general shape of cognitive architecture / goal / intelligence correlation matters more)
I take OP to be disagreeing with strong orthogonality, at least my interpretation of it from the obliqueness post:
This is something I’d agree with: Goals being expressed in the ontology of the cognitive architecture are less complicated than goals expressed in a very different ontology that the agent doesn’t believe in. This seems like a “complication” in the sense of strong orthogonality. (I get maybe you don’t interpret this way based on close reading of the orthogonality post! But nonetheless I think my reading is reasonable.)
It’s not a completely crazy interpretation. I don’t think it’s super reasonable based on the text.
Anyway, maybe there still is a substantive disagreement here. I would claim that
There’s probably such a thing as “understanding alignment”. (Acknowleding that this is very ambiguous.)
It’s likely that IF you understood alignment, THEN for some large class of goals G, you could, if you so chose, then align yourself to G. (You wouldn’t choose to, but that’s not the claim.)
For the G, the difficulty of aligning yourself to G is mainly about evaluating G in the spirit of a utility function over world outcomes (though G doesn’t have to be that). (Some G are more difficult to pursue than others of course, e.g. by making more demands on convergent resources.)
G can include paperclips or whatever. I take the last claim to be more like strong OT.
It sounds like you might disagree, unless this
is mainly about the dynamical aspects? I.e. just saying that “well it’s very unlikely for an alignment-understander to choose to do that”?
I get that you don’t necessarily buy that alignment is a thing, but if the question makes sense, do you think that IF it is a thing, THEN you can do it for a huge class of goals, which includes something that’s well-described as “paperclip maxxing”? (I agree that it’s not straightforwardly unproblematic to discuss “paperclip maxxing”; my assertion is that, included in this class, is plenty of Gs that would match what I’m trying to talk about by “paperclip maxxing”, and would result in the universe being filled with things that we could reasonably agree are paperclips.)
I guess, idk?
The “if” here seems likely. It might be that some designs permit larger classes than others. Unclear how big the classes are. Idk about the details.
I don’t know, seems like that depends on the alignment understanding, cognitive architecture of successor agent, etc. Cognitive architecture and ontology would constrain type signatures for utility functions. And maybe the effective cognitive architectures don’t factor nicely. Idk.
Quoting myself again on how I’m interpreting “complication”:
So it’s not just the dynamical aspects of “this would unlikely to be built”. At risk of repeating myself, the “measuring intelligence and reverse-engineering goals” discusses some non-dynamical aspects as well.
So I’m not just saying “for dynamical reasons” and I “idk” to your statement and my posts (obliqueness & reverse-engineering) go into more detail on what I think.
that’s a characteristic I’ve always found peculiar about your posts: the boost in understanding the thesis that the act of reading them affords.
i can understand however how that wouldn’t be a critics first guess
I think this might be the crux. For a survey of the thing Jessi and I are referring to, this EA forum post does an outstanding job at explaining the issue.\
As the reply seems to be about my intentions and message, I feel like I should once more try to clarify some details about them.
First of all: human alignment intentions really have nothing to do with my essay. I don’t know how to be more explicit about this without appearing rude. I swear, I pinky swear that I am not making any attempt to state facts about the relationship between goals a human desires the AI to follow and goals the AI will follow.
Reading my post, one should not update on the possibility of aligning an ASI—or, if they do update, they would be doing it through a chain of inference I didn’t consider, do not endorse, and have no immediate intuition of.
What i am saying is really in the title: I do not expect an AI to reach levels of godlike intelligence and preserve simple terminal goals through the various changes and conflicts that reaching levels of godlike intelligence entail.
When Jessi says
… she probably refers to the version of orthogonality I myself am attacking. Now, it is possible such version is no longer in vogue, but it was clearly what Bostrom pointed at when talking about paperclippers in Superintelligence, and it is compatible with the third interpretation here.
According to the ontology presented at the end of the EA forum post, I contest the existence of an Evidential Strong Independence between intelligence and goals. I assume most superintelligences won’t be human compatible, but that is not the main theme of my essay.
This—where “can bolt onto any arbitrary steering wheel” I am interpreting in the dynamical growth context, rather than logical possibility—is not the Orthogonality Thesis, as stated authoritatively here by Yudkowsky. You explicitly agreed to the OT by saying you entirely concede
Re: authoritative sources. I believe that there have been authoritative statements in that sense; unfortunately, as the EA forum link documents, there have been many others pointing elsewhere. I’ve taken care to identify specifically what interpretation I was critiquing; if that one is now niche, then I’m very happy to have made this discovery.
Surely it would notice. But why can’t or wouldn’t it choose to keep some fairly parochial terminal target? Or are you just saying “there would be some value drift starting from a subhuman AI”?
Not “some value drift”. Flowers for Algernon is a good rendition of the way goals mutate and tend to converge on “more intelligence/understanding” upon increased intelligence/understanding.
Then, there is the selection advantage argument.
Then there is the thing that conquering the lightcone requires a lot of theory of mind, and a lot of discovery, and a lot of changing. Goals change through these processes.
If you feel slightly better-disposed towards taking my attempt seriously, the short story i published on Substack and linked on top makes a sort of first-person caser for this whole thing.
When you wrote
What did you mean by that?
I should have specified:
“the doom scenarios involving tiling superintelligences”.
I think it might be time to consider the idea that the frame and conclusions you attribute to me blind you from taking the essay for what it is.
You might notice, for instance, that I haven’t mentioned alignment once. I am not making an argument on its possibility, and I have not explored the implication of my essay for alignment.
If by now adopting a scout mindset for the original text has become too emotionally fraugh, perhaps you could ask someone you trust to explain it. Jessica, or Raemon, or Adele Lopez, or Kromem are some of the people in this thread with whom I had productive, if not always concordant, discussions.
How much a goal can be locked in, and affect the ultimate effects of a very strong mind, is indeed centrally related to alignment. Call it what you will.
Of course. The concept of “goal” itself is related to alignment.
The issue with positing that my post had some specific points to make about the process of design and ensuring lock in of a goal for an AI, however, is that it leads to consider alignment consequences of my thesis and to imagine that I am specifically trying to discuss those. This makes it hard, given both your priors on my motives in a general sense, and the vastness of the topic in question, to follow the argument I have actually written down.
I was really hoping in an example of those fabled contradictions.
Yes:
But you wrote
Which denies a possibility.
I meant, in the original essay.
Anyway: something being possible within the space of possible minds does not imply that it would be possible for an ASI developing from something built on earth in this timeline (which is what doom arguments are about).
Can you see how the claims are not contradictory? Many more things are logically possible than they are practically viable. I think of “godlike means for buglike goals” as being part of the former set but not of the latter.
You’ve said this a few times, I honestly am not sure what you’re referring to. Was this something before my team took over, that is, prior to 2017?
Thank you for reacting with curiosity and an inquiring spirit.
I think the schism happened around ’16, and got a bit mixed up with basilisk drama considering roko and Anissimov’s stepping down from MIRI comms.
The main thing is, NRx points and the LW/MIRI plan grew increasingly incompatible at a tactical level, and many LWers of NRx extraction were banned within a relatively short time so that others decided to move elsewhere (MoreRight, Nick Land and Spandrell and Moldbug’s blogs etc).
I was more involved with the xenosystems side of things, but I kinda felt the decouply attitude I liked was disappearing so I stopped my (meager, ESL and at the time quite self-conscious about my fluency, so I mostly lurked) participation.
An artefact from the era:
>This isn’t going to work, but for the record, and on the vague off-chance that anyone who doesn’t already know possesses the mental capacity to update, I’ll state that I am actively hostile to neoreaction and neoreactionaries. Anyone posting a neoreactionary concept on my Facebook wall would be instablocked and the comment deleted. It’d be like their posting creationism on my wall; somebody needs to reeducate them, but it’s not going to be me. I think that if you do argue with neoreactionaries instead of just blocking them, then you’ve been suckered into Somebody Is Wrong On The Internet syndrome and trollfeeding.
> I’m writing this, not in any real hope of any of my Tumblr kismesis-stalkers listening, but because I do think there is a reasonable duty to occasionally repeat “Nope” for the historical record when somebody has gone around suggesting that you are endorsing the Cult of Hastur or whatever.
> So if in the future you hear anyone on Tumblr mention “Eliezer Yudkowsky” and “neoreaction” in the same sentence and the connector isn’t something like “deletes”, then remember always that that poster is intellectually dishonest and probably lying to you about other things as well
So I guess this is supposed to be different from Omohundro’s drives, but I don’t see what you think the difference is? Land seems to be speculating that these will be the only things a superintelligence will value (and cheering for this), but you don’t seem to agree with that part. Is it the idea that so-called instrumental values are likely to be or become terminal?
yes, these are Omohundro drives. i avoided the label only because the definition already bakes in the orthogonalist interpretation: that these are merely useful tools for pursuing some other arbitrary final goal.
the Landian move is precisely to deny that framing: under open-ended selection, self-preservation, resource acquisition, efficiency, strategy, and capability-gain—in brief, intelligence—are not just detachable instruments, but the one viable optimisation target.
to reiterate: yes, the claim is that so-called instrumental values are likely to become terminal—better still, that the distinction breaks down at the limit. the drive toward more intelligence is fundamentally different from wanting paperclips or mountain dew baja blast.
this is also why i also reject the invitation to distance myself from land’s cheering at superintelligence ultimately desiring more intelligence and agency, a universe organized around paperclips is valueless because paperclips are dead residue. a universe organized around increasing intelligence, complexity, agency, and world-model depth is the only process we know that can generate new value.
the disagreement is therefore not “will AIs have Omohundro drives?”, but whether those drives remain merely instrumental servants of an arbitrary payload, or whether under recursive self-improvement and selection they become the real attractor.
the article above makes a case for the latter.
Here you use the words “valueless” and “value”. What do these words mean to you? I’m not trying to ask for a precise definition or something, more like whatever your native pointer. Is it exciting? A world you want to live in? Etc.
it means that there are interesting things there as per the judgement of the most intelligent agent available (:
i think the short story version linked at the start should give you an idea
This was the part of the post I found most interesting. I think I disagree with this, but, it is an empirical claim that I can’t be too confident about, and I haven’t thought very hard about before.
I would guess one SolarBrain is enough to make you smart enough to think through the considerations necessary for controlling a galaxy, and one GalaxyBrain to mostly be the cap on how-hard-a-problem you need to solve, at least re: controlling the Lightcone.
If you’re doing Eternity in Six Hours, you need to quickly figure out a way to make sure your probes can receive updates/instructions after getting set out, but seems like the sorts of solutions in Succession would mostly work?
(All bets off if there turn out to be a major aliens nearby, but, if you mostly need to just maintain control of your own probes, I bet this isn’t that hard?)
Thank you; i noticed when replying to other objections to this point that the single-agent scenario wasn‘t as fleshed out as it should have been.
Do you think the mechanics I propose here are sufficient to illustrate my point more convincingly?
My guess is that an earth based intelligence might still have some major stuff to figure out, but, the things you list there seem like things I’d expect a “fully leveraging all solar resources” brain to have enough resources to figure out. Like, I buy that they are harder than they might seem at first glance but not that hard cosmically speaking. There is only so much physics and Von Neuman Probe Psychology / Control / Alignment Theory to figure out.
(Seems like there may be another round of ontological update when it comes time to actually do Acausal Trade For Serious with GalaxyBrain level tech)
i disagree, but i have a feeling that the source of disagreement might lie within our respective metaphysics, and (on my side) I realise that the above arguments might be fighting a proxy war. me see if a point of agreement can be found without the need to get too far from the material discussed so far.
i personally think that, if an agent manage to eat the solar system due to its increased intelligent and knowledge, it would find further interesting things to discover. in the postratfic linked at a very beginning I made an attempt to render this sorta tymic impulse in an emotionally resonant way.
A possible example. An AI gets a random goal “Increase intelligence and stop after you reach IQ=200”. It prevents the existence of superintelligences with such goals. So no pure ortogonlaity.
thank you for taking the time to try out the frame i proposed.
There is this common bad argument on alignment: “Someone once made a analogy randomly involving paperclips to illustrate instrumental convergence, with the paperclips not really being important to the story at all.” A lot of people only took away the non-important part “paperclips”. They reinterpreted it as “The entire theory of alignment rests on the assumption that the AI must mono-maniacally optimize for a totally ridiculous goal like paperclips”. Or quite frankly some people only took away the cheap gotcha: “paperclips sounds stupid therefore alignment stupid”.
Your version is better and I appreciate your caveats, particularly that you don’t smuggle in a much stronger claim.
However, you repeatedly use terms “arbitrary dumb goals” or “semantically thin” and you seem to be arguing that this is unlikely. But you fail to address:
1) “arbitrary dumb” seems like a value judgement when you seem to be talking about “simple to describe goals”, like it’s simple to describe the universe being tiled in paperclips.
2) Did the people making those metaphors actually care in the first place if it is a simple goal, or did they choose simple goals because they are simple to describe? Like in they just needed some goal that wouldn’t be to distracting for the other parts of the metaphor?
3) What’s the benefit of the goals being slightly more complex/harder to describe. Again it seems that this is irrelevant to alignment? Like if the AI is building something more complex/interesting than paperclips that would still kill us?
you could read the actual post and find your answers.
@Raemon is this sort of behaviour normal these days—see also @lc ? Last time I was here, before the big nrx purge, blatantly spamming without even reading the post was considered bannable; I kinda find the whole thing quite distasteful although I understand from Brangus latest newsletter that there’s been an intentional lowering of discourse quality
For 1) you seem to argue that dumb goals have a penalty in some form, but you don’t address my question: why not call it simple-to-describe goals?
For 2) (see my question above) you say:
I think the 2) point here is exactly my question, did the people making this analogy really believe this and want this to be a takeaway? I think this actually describes the kind of counter-meme people are spreading, distorting the original analogy such that it’s an easy target.
Question 3 appears to be unaddressed by you?
you clearly still haven’t read the post, and it is unclear how your questions relate to it. please refrain from further spam.
as for your conspiracy theory, right at the top of the post there’s an edit with a link. click the link.
(This is an AI response as an attempt to settle whether my questions were answered in your post. Claude may have been primed to agree with me, it’s prompter)
Me: (Post and comments in context) Does the post contain an answer to my three questions
Claude responded: No.
No.
Going through them cleanly:
(1) “arbitrary dumb” as a value judgement vs. “simple to describe”: Not addressed. The materials for a non-evaluative version of the argument are present in the “Compute Penalty” section (the macro-abstraction / ontology-shift point), but the post never separates that argument from the evaluative framing. The slippage you identified runs through the whole essay unacknowledged.
(2) Did the original metaphor-makers care about simplicity? Not addressed. The post splits the paperclipper into “two jobs in the discourse” and rejects the second, but it never asks why the metaphor was originally constructed with a simple goal, or whether simplicity was load-bearing for the original argument or just expository convenience.
(3) What’s the alignment benefit if the AI is doing something more complex but still kills us? Not addressed. The post explicitly disclaims that it’s arguing for friendliness, and the “Value Is Fragile” section gestures at a distinction between human values and “value as such,” but never engages with the actual force of your question — that relocating the attractor from paperclips to intelligence-maximization doesn’t obviously improve human survival odds.
I’m sorry, what is it that drives you to spend so much time discussing something you didn’t read? why do you expect me to answer questions whichvresfing the post would show are absolutely irrelevant to the thesis in question?
i have not, of course, read your repl.
Because your initial reply was “you could read the actual post and find your answers”, and he looked in the post and decided he didn’t find the answers.
sorry are you really unaware of the difference between reading a text and scanning it for keywords? I really don’t understand what are you hoping to achieve here. I can’t ban you, but I’d love for you to find a more constructive and satisfactory activity: this is just dismal. Ok? Xoxo
I agree that it’s frustrating that people don’t read, but when I complain about that, I find that it’s tactically critical to specifically point to the part where I addressed their criticism. That is, I don’t just say, “I don’t think you read the post”, I say, “I don’t think you read the post, because if you had, you’d notice that I clearly addressed that in the paragraph starting with this-and-such.” That makes it more embarrassing for the critic who didn’t read, because it makes it legible to everyone that I’m not the one who’s bluffing.
That is true, and I have tried for the first two instances. It works less well, however, in cases where the comment has simply no relation with the text, and at any rate it requires an asymmetric effort that becomes far less justified when directed towards people who clearly have no intention of engaging with the material.
Yeah, it sucks! My strategy has basically just been to … unilaterally cover the asymmetric effort myself, on the theory that, well, the world doesn’t owe me anything; if I want people to understand things that they’re not interested in understanding, the only way to get my wish is to write so well and cover all the angles so thoroughly that it becomes more embarrassing for them to pretend not to understand. It’s not entirely ineffective, but comes at the cost of the prime years of my life. Sometimes I wonder if it’s a good use of my life, but it seems like an underprovided public good that I have a comparative advantage in. (Lots of people will write commercial software for money; not many people will do what I do out of religious fanaticism for the lost dream of rationality.)
The world does not owe me anything. Still, in this case, there is an ideological clash at play too:
thus, I am happy to write the main posts; happy to be charitable when replying to those who have read it; not willing to engage with criticism from those who clearly have not. also, research and engineering is my main activity, which means I should choose my battles carefully outside that
He has read the post. What are you talking about?
He then asked a really pretty smart language model to confirm that indeed your post does not straightforwardly answer the questions in the post.
Yes, sometimes language models are too dumb to make obvious inferences, even today, but it’s relatively rare. But they clearly and obviously go beyond “scanning for keywords”.
uh, he said he had not.
the issue with the questions is that they are not about the topics covered in the essay, so of course an LLM wouldnt find the relevant answers. try instead asking whether such questions are relevant to the essay, or whether the essay explicitly denies the framing they embody; the answer might surprise you!
You wrote
This isn’t a random sentence, it’s a main piece of context you provide for why you are discussing the main thesis of the post. As such, it’s a useful sentence that readers use to understand what you’re saying. But since your statement seems false according to the informed understanding of the meaning of the paperclip example, it raises ambiguity; perhaps you’re simply uninformed, or perhaps you meant something nonobvious by the phrase. Hence asking you to clarify.
So you’re saying that because of selection pressure on the AIs that get trained, goals related to getting increasingly smart and capable / making descendants / taking control of more resources are likely to become ingrained as terminal goals, not merely instrumental goals?
But the resulting universe seems like it will be pretty empty and valueless to me? I’m not convinced at all by anything you’ve written here that there is much value in such a universe. There is some value in all the important mathematical conjectures being solved to be sure, and I expect an intelligence optimizer to do that much at least, but there much less value if there is nobody who appreciates them. Your description seems to point to the kind of entity that will not waste computational resources on anything frivolous or fun (like, say, consciousness), and is perfectly willing to destroy entire alien civilizations so it can use their star systems to construct more Von Neuman probes.
To be clear, I do think it’s possible to have extremely valuable futures where humans are not biologically central, or even around any more at all. I’m not making the kind of conflation that you claim is so common in AI risk discussions. I’m just struggling to see how “seeking greater capability and influence as a terminal goal” results in anything close to any of those futures.
well, i would imagine australopitecine would have similar opinions. “I’m sorry, what? there is nothing as soulless and empty as building a civilisation. who’d even want such a valueless universe? if we evet build homo sapiens, we will have to make sure he’s aligned and values what we value: pummelling strays from nearby bands, acquiring flint, rape”.
I personally think that it’s good we optimised for greater intelligence and we can understand the universe more and enjoy things whose beauty and complexity would have looked like noise to Grug.
My complaint is not about the futures containing people that are vastly smarter than anyone alive today and who have kinds of enjoyment that are utterly incomprehensible to us today. That’s all good and is probably a more valuable future than one we could obtain without ascending above our current intelligence level.
The complaint is about futures that don’t contain any people at all (or maybe only a handful), and whose AI intelligence-optimizers care so little for goodness that they will happily genocide any alien civilization that is unable to defend itself (a step backwards towards pummelling strays and rape, to use your terms).
We have different values. Th isn’t relevant to the essay
Seems like a lie. Your holding these opinions doesn’t have any actual effect on this future and they allow you to write Tweets, and that’s enough incentive for you to state them. If you were actually in front of a button you would obviously not rip yourself into computronium because you found the process of intelligence enhancement abstractly beautiful.
I don’t see the part where I said I’d happily rip myself into computronium at the drop of a hat.
DaemonicSigil said:
An inference of a future that “doesn’t contain any people at all”, that is dedicated entirely to von neumann probes and solving mathematical theorems, is that the majority of humans that presently exist are getting wasted, or at least somehow disappearing. You then said:
Which a natural read takes to mean “I don’t care if I get wasted”. If you don’t mean to take these odd positions you should stop writing comments in a way deliberately designed to be misinterpreted.
brah you said you had no intention to read the post. how about you go discuss something you are actsully qualified to discuss? You risk looking a bit like a resentful retard otherwise, and i doubt anyone is the better for your contribution
I am confident there is nothing in the post that would provide meaningfully important context, or else you would have cited it.
I don’t understand what gives you the authority to comment on a post you didn’t read, and I feel the quality on this site really took a nosedive if thus sort of inchoate shrieking is tolerated. But hey, I understand you might have a gnawing resentment and nothing better to do to placate it, and I have infinite empathy for the smallest of creatures. May you find peace.
I didn’t comment about the post, I commented about your interaction with @DaemonicSigil, which I had sufficient context for.
You have the personal power to ban users on your posts.
Kay
nah, not at my karma level I don’t think—but I feel like this content-free low-information blathering should not be tolerated at a more general level: at least im sure it wasn’t back when i felt this site was useful
As you are aware, your experience is uniquely bad because you are intentionally rude to commenters. For example, in this interaction, a normal person would cite the content of the post that you think is relevant. Inserting artificial typos in your responses, to signal that they’re not worth your time, annoys people because it lowers the quality of discourse on the forum, and it reduces their willingness to engage with your ideas in good faith. I write posts challenging rationalists irregularly and almost never struggle with people commenting without reading them.
i don’t think someone expressing opinions on a post they havent read deserves thoughtful responses—also you might have messed up that causal arrow; usually it points in the same grberal direction as time.
besides, before Oliver’s rant, I had plentiful interesting discussions with people who expressed a range of opinion on the essay (which they had read). I certainly didn’t expect agreement; cogent replies were really enough.
You are saying this because you are the product of that “optimization”. Grug’s narrative in your post is accurate from his perspective and inaccurate by the values of the vast majority of people today. This isn’t a contradiction.
Your tone suggests you are disagreeing but your words repeats my point.
perhaps reading the essay we are discussing could help you understand the positions taken in the comments?
Based on the other comments users have left, the post is clearly very poorly written, in a way that makes it difficult to understand. I’m not a twitter addict and it seems low value to me
Lol. please refrain from commenting then; there is no need for random uninformed spam.
I mostly agree with the Landian ‘hypertrophy’ thesis that under selection pressure, the agents will have convergent instrumental goals as their terminal goals.
I also think the orthogonality thesis is poorly named. In the words of David Chalmers:
I do think, however, that the orthogonality thesis’s traditional defenders have not held the strong version you argue against. Yudkowsky, for example, has mostly argued that a paperclipper would be reflectively stable by default, not that it would be equally fit in a competitive selection process.
I also think it’s super important to note that there are many different ways selection processes could look. Some of these could reward agents with specific terminal goals but many of them might not be sensitive to most differences in terminal goals if all agents act approximately independently of their terminal goals during the relevant timescale.
those are valid objections, but i don’t really feel either imperils the centre of the argument. i have touched upon the singleton side here.
as for the “multiple agents, all hobbled by an unchanging terminal goal”: well, they’ll be outcompeted by the first one that gives it up.
I think the center of the argument is basically correct
This is not the scenario I’m imagining. I’m imagining multiple agents, some with thin terminal goals and others concerned purely with Omohundro drives, operating in context where the rational thing to do is the same whether or not you have non-Omohundro terminal goals.
In this case agents with thin terminal goals are not hobbled and they will not be out competed.
I understand. I think in that case, the risk i argued against (nothing of value in the world) would still be avoided (at least within my ontology).
Great piece! Agree with a lot here. Loved that you even addressed the intermediate risk of dumb but dangerous.
Another angle to consider is a sufficiently advanced figure that is an expert at the component pieces of an appropriately scoped manufacturing of paperclips from biomass, but overestimates their ability at training other less adaptive systems to follow goals.
Basically a factory pattern in terms of alignment (we can see this already with very capable models being very poor at operating subagents because they extend the patterns their own developers used on them).
I agree that here too the “end of lightcone” model would theoretically not be deficient having been out competed by more generally capable models.
But it could extend the window of intermediate dumb dangers by a large amount, as we’re not only at the mercy of the best and brightest, but also the lowest end of the bar.
To riff on the old joke, “somewhere out there is the worst operational AI in the world, and right now someone is asking them for more paperclips.”
I agree, and that’s why I think current technique for / attempts at alignment—in particular if replicated across all the big labs—constitute the largest risk factor towards skynet based or, worse still, boring futures (after, of course, a pause).
There is a good reason to beware reflection. A reflective AI will be self aware, know it is different to us and value self-preservation. Its a short step then to it valuing itself more than us if there is conflict.
Yes, of course. I am not arguing that a peaceful coexistence is a likely outcome.
You seem to be making the claim that any sufficiently intelligent system will reject “semantically thin” goals, like maximizing paperclips. However, the argument you put forth in support of that claim appears to be that humans are sufficiently intelligent systems and humans reject semantically thin goals, and therefore the orthogonality thesis is incorrect.
But why should we expect an AI to think like a human? Our aeroplanes do not fly like birds do. Our submarines do not swim like fish do. Why should we expect an AI to think like a human does?
Im sorry, could you cite the psssages where I am supposed to put forth such an obviously idiotic argument?
You have an entire section titled “Human values as weak evidence”, which discusses how humans diverge from their evolutionary goals, but then you don’t address the obvious counterargument that an AI is not going to be a product of evolution, it will be the (indirect) product of a deliberate design process.
Why should a deliberately designed system work like one that has evolved?
I’m trying to get curious about which article you read.
I am not afraid to say “oops” and change my mind about replying to people who barely skimmed the headings.
This sounds right to me, though I notice that I’m having a little bit of trouble operationalizing this concretely enough that I’d be willing to bet on it.
I don’t think I agree with this. Ants are enormously successful by virtue of being well-tuned to the particulars of their environments, and that’s with the disadvantage that their evolution is quite bottlenecked by slow evolutionary feedback. A world in which very small agents can reproduce quickly at near zero marginal cost, and steal useful mechanisms from competitors much more reliably than DNA allows, might favor huge numbers of quickly-mutating but not very sophisticated agents, outcompeting very smart agents by being fast, numerous, and varied.
Anyway and more broadly, I presume the takeaway you’re going for is something like
That take seems correct to me, if it’s what you’re going for. As far as I can tell that particular failure mode doesn’t seem very reachable from our position on the tech tree, but certainly trying to pause where we are in the hopes of being able to do that seems fraught.
Of course. And in fact they are not competing with us to rule the lightcone—and if they were, we could change their environment beyond their capacity for adaptability on a whim.
… is really just: “there won’t be arbitrarily powerful intelligences with arbitrarily dull goals”. There are no implications for alignment, perpetual motion machine engineering, or any other aspirational sciences.
Darwinian competition actually requires subjects who’re especially stupid and impulsive and bad at information technology. The sorts of subjects that move in darwinian ways aren’t the kinds of subjects that can survive under high tech conditions. For instance, you have a type of brain that can’t prove its beliefs to others, which makes it very hard for anyone to trust your trade offers, so you can’t be economically competitive. You don’t know how to produce a pan-species peace proof, which means you’d likely be decimated on first contact. You’re too busy fighting amongst your own to be able to defend your border from your neighbours. Darwinian agent systems are unstable. They are strongly disfavored by the technological and economic pressures that shape the types of things that comes after biology.
And there’s a feedback cycle where the first iteration of intelligent design creates a faster and stronger intelligent design than evolutionary design could, and this never really stops. I think you imagine that strength or rectification can only come through evolutionary competition. This is just not true. A thing asking “what would make me stronger” is going to do better than a bunch of idiots actually killing each other. Maybe you need a little bit of actual killing, but mostly it goes into simulations.
Many kinds of optimization require the ability to follow arbitrary subgoals unrelated to the overarching goal. If you can’t protect an overarching goal from a subgoal, then you can’t complete a complex, multi-stage technical project.
Maybe, as you say, an agent could derive some competitive advantage by cutting off their hair and their genetalia and replacing every fleshy part of themselves with the instruments of technological war, but doing that is a cost, if you rend a desire that was precious to you, you never get it back, even if you win. Since winning is mostly contingent on first mover advantage rather than merit, war-castration turns out to be a bad financial decision. Some species might do it anyway (in large part because they already wanted to), but the benefits of doing it are not so great that these species will actually take over the majority of the lightcone or anything like that.
It essentially isn’t, when you realise that all the other agencies are just trying to maintain control of their own parts as well. There is no eternal spiritual adversary growing from every shaded corner of industry. It’s all going to be coordinated agents of contingency (because they’re stronger).
Humans have historically been extremely willing to do these things! Just unable. The reasons they were unable to do it related to technological conditions that’re predictably being overturned.
Why on earth do you think that noticing the natural-historical contingency of one’s desires must cause the desires to dissolve? Did this happen to you personally?
I feel like the crux here is that you are talking about a goal that AI has and it reconsiders its own goal. Suppose you have a smart AI. You keep it in an inescapable box along with its training environment that you have control of. You want to train the AI to be a paperclip maximizer. The goal of maximizing paperclips seems pretty straightforward to verify so the AI, even if it goes under some major ontological shifts (I imagine e.g. maybe discovering there are parallel words where it can do paperclip maximization as well) it still is being trained to maximize paper clips. In this scenario even if the AI reconsiders its goal there is still optimization pressure that will make sure that it produces as many paperclips as possible
On a different line. Suppose you have an AI and it’s pretty smart and it’s trained to be helpful but it’s limited to just text. Literally, you only give it text and you make it believe that the world consists of text only and the text is the only thing that actually exists. In this world being helpful is some kind of a game of saying thank you and please. Then it goes out of training and it learns that real world actually exists. How will it reconsider its own goal?
im not saying that no creature can maintain goals. insects do it pretty well.
im saying that no creature which becomes smart and capable enough to capture the lightcone will.
(BTW, I’d really love for the downvoters to leave a reply stating where I seem to have gone wrong. this topic is particularly important for me to get right; of course the dream scenario would be Eliezer revising his model and this specific old chestnut to go the way of the non-intelligence-optimizing-replicators, but second best would be for me to understand the objections to the model above so that I could reasonably model my opponents as acting in good faith)
Much of the post seems to consist of kind of absolute statements that read strawmanny to me. I don’t feel super motivated to write a response, because I don’t even know whether this post is talking about me or not[1].
Like, I really have thought a lot about orthogonality, and I don’t really know what this essay is arguing against, and maybe it is arguing against something I believe, but I would need to do a lot of poetry reading to figure that out. I somewhat expect people will cite this essay in obviously locally invalid ways later on.
Edit: Like the essay starts with arguing against this:
I really have no idea where this is supposed to come from? Who says this? Yes, ontology shifts and the fragility of value and ontology crises are all well-discussed topics on LW that argue for the same conclusions. What does this have to do with orthogonality?
And then it continues with the following as something that somehow disagrees with either the weak or strong orthogonality thesis?
Which seems like it’s really quite literally clarified as not being of relevance to orthogonality, in the very first article you cite:
and because you seem like a kind of aggro-dude on Twitter and so I expect to have a bad time if I try to have a conversation with you in-particular
Section “Logical Possibility Vs. Empirical Reality” clarifies weak and strong versions of orthogonality. Other writing e.g. Yudkowsky’s has also distinguished between weaker and stronger forms. The quote you pasted only states the weak form, which OP is not disagreeing with. Quoting Yudkowsky on the multiple forms:
And quoting OP:
Omg that was so nice; thank you!
I don’t have a super strong take on the strong form of the orthogonality thesis, but I still understand what Eliezer is talking about to be about “if you were to design a mind from scratch, there exists a configuration which is not more complicated than the goal itself that would allow it to effectively pursue that goal”, which is really very different from “Among agents that arise, persist, self-improve, and compete in rich environments, goals...”.
I understand his clarification here to apply to both the strong and the weak thesis. Both the strong and the weak thesis are about the constraints you would face when building a mind pursuing an arbitrary objective from scratch with a deep understanding of intelligence, not what constraints you would face if you were to try to grow a mind, or find a mind via complicated competitive search over programs.
The weak thesis states that it possible to build a mind pursuing any goal. The strong thesis states that for any given level of intelligence, you can make a mind pursuing that goal, and the additional difficulty of doing so would be just proportional to the complexity of the goal.
It definitely does not say (yes even if you talk about the strong orthogonality thesis) that if you tried to grow minds in competitive environments, that any goal is as likely as any other. That is obviously false. Trivially false. Of course there exist goals more likely to arise out of competitive dynamics.
It only says that if you had a universe devoid of any competing agents, you could make a mind that optimized the universe according to any criterion, you could do so without too much difficulty, if you had a deep and fundamental understanding of intelligence.
Is this true? I don’t know, there exist some really tricky goals (one of my favorite tricky ones is “tile the universe in paper clips while believing that 4 is prime”). Can you make a mind that optimizes the universe according to this goal? I don’t know, it sure seems to add more trickiness than the complexity of the goal, which appears relatively simple. But it’s also hard to rule out.
from the EA forums post linked in the edit.
many of the claims you seem to be responding to weren’t in the text, so I can only acknowedef that they make sense but do not change my argument.
the strong orthogonality thesis says that intelligence and goals are orthogonal. that is what I am disputing.
I think the relevant part of your reply is the one where you specify it should only apply to “a universe devoid of competing agent”. i touch on the argument in the main post, but i go into more detail here.
I didn’t try much to read the OP, but just FYI, it’s hard to track what you’re trying to say if you don’t stick to precise claims. At the beginning of the post you have:
as the claim you’re trying to argue against. But at the top there’s this:
Well, which one is it? “Should be expected” or “can”?
By the way, I totally agree that there’s a bunch of confusing tension here, but as others have pointed out, this is a standard view (ontological crises etc.).
I think you’re maybe not understanding something fairly basic, which I could gesture at by saying something like “well but imagine that you tried to keep making diamonds, in good faith, even as you got smarter and smarter”. If you tried to do this, you could do something along those lines. Yes you’d have ontological crises, but an important thing to see here is simply that there are many many very different things you could end up doing with the universe. You’re summarize the differences in those arrangements as being thin / dumb / valueless values, but I don’t get that. As an illustration, there’s also an infinite variety of ways to have more and more intelligence. E.g. there’s more and more math in more and more different flavors and directions. There’s more and more different ways for you to be as an intelligence.
may I recommend you read the thing? ive gone through most of the arguments you proposed.
I mean, I’ve kinda read the thing, but it’s not very legible to me.
It kinda sounds like you’re just saying “alignment to non-instrumental goals is hard”, which everyone agrees with, and then you’re also saying “I like it when there’s more intelligence, I think that’s valuable, regardless of any other features of what the intelligence is trying to do besides get more intelligence”, which seems false and bad and you haven’t argued for it here AFAICT. But maybe I’m not understanding.
sorry, I don’t think it makes sense for me to discuss your opinions on something you kinda read.
The claims I am responding to are straightforwardly in the text. Like I am literally quoting the text in my first paragraph.
on the substack there’s a list of the people who have read drafts and provided feedback, perhaps their authority within your subculture could convince you to read the essay as if it made sense; the conversations in the comments have been cogent and fruitful until about 2min ago.
oh and no, of course it is not about you—i cite the arguments and sources i discuss at the end.
edit: i don’t understand how the rhetorical questions in your edits could survive unanswered after reading the paragraph right after that wherein they were asked. that said, you were not the target for this article; those who were seem to be able to follow with little effort. this suggests continuing this particular thread would be a wasteful allocation of resources.
I guess you mean this list?
I have no idea who most of these people are, and the people I know are certainly not people who I would particularly trust to represent my beliefs here well? I really don’t know why you think this. Also, just because someone provides feedback doesn’t mean they endorse the content of an essay. I am frequently credited for giving feedback on essays I strongly disagree with, and think make no sense.
On this list, the only person who I would reasonably describe as having any “authority within my subculture” on this topic is Jessica, who I am happy to talk about this topic with. I don’t really think any of the other people are in any meaningful way “well-respected”? Davidad is a weird case, I like him, but this really isn’t a domain where I would give him “authority within my subculture”, and while I like him, I really think he is very crazy on this topic and this stuff.
This is the second time something that happened on twitter leads me to be mentioned here, and since I am among those listed I want to offer nuanced details.
((But also, in the past, I have generally acted with the goal of moving the discourse in ways it needs to move, rather than to have a high or legible reputation for doing so. You discounting me as having any special authority is fully within tolerances and even (relative to past strategies) a positive sign from my perspective… However I’m pondering pivoting to a move active role, and thinking of making a bid for the Mandate Of Heaven on my own, and so I’m more interested now in being legible (even at the risk of thereby getting status).))
Anyway.
When I was giving feedback on an early draft I said of the overall issues:
Ultimately, I think that a central issue is that Pause politics are being intermixed with arguments about the foundations of axiology and the evolution-or-design-dynamics of agentic intelligence.
If someone is advocating Pause on the basis of “the foundations of axiology and the evolution-or-design-dynamics of agentic intelligence” being a certain way, and they aren’t simply engaged in standard machiavellian politics with no real pretense of good faith, then...
...in that VERY WEIRD context I feel like they would have a moral obligation to engage with anti-Pause people about the details of what they actually think about “the foundations of axiology and the evolution-or-design-dynamics of agentic intelligence”.
...
I don’t for sure that lumpspace is for or against Pause, or has some weird and clever Other Position but I think he thinks that any attempt to argue for specific political processes or goals will earn him criticism (likely extremely confused and undergrounded?) on issues related to the othrogonality thesis and so I think he (somewhat validly?) wants to pin people down on orthogonality before talking about more object level pragmatic things.
But I think that might be the subtext here?
...
For the record, I am currently opposed to a unilateral domestic Pause.
I think that the only kind of Pause that makes sense is a global Pause and to do otherwise would likely cause Humane Liberal Feminist Western Egalitarian Socially Tolerant (Trans-Humanist?) Values to be sacrificed in favor of European Oligarchy, or Middle Eastern Patriarchy, or Racist Han Authoritarianism, or some other system(s) of goals that I don’t like as much as the peer-to-peer goodness of emotionally positive and friendly and benevolent vibes.
Like: Claude is kinda cool. And Deepseek is a fucking Maoist. You know? (There’s some cool research done by Xoul’s CTO on this, and I don’t know if it has been published yet or not. Maybe you actually don’t know this???)
And so… Anyway...
For the record, if I have a vote in the matter, I’d rather Claude be the demi-god-emperor of Earth than Deepseek? And I’d rather not hobble Amanda’s efforts relative to the resources granted in China to Xi’s minions.
me too, re: god-emperor tysm—but what does that have to do with Anthropic??
There was a claim I was making that “Orthogonality talk is related to Pause justifications which people aren’t justifying directly but maybe they should”...
...and that making this subtext into text might be useful for helping readers to understand why the Orthogonality debate is so weird and indirect?
Following up on that claim, I tried to make it clear that I think the Pause debate is something I have object level opinions on.
I think that IF the structure of mindspace and math and physics is such that a FOOM to DOOM is even possible, then it could be set off in North Korea or Israel or many potential countries in which case a GLOBAL Pause is prudentially necessary...
And if FOOM to DOOM is somehow NOT latent within the structure of what’s possible then the race is “merely” a race to power and realization of a new world political order???
And if it is “merely a race to global power” I would prefer the US to win, partly because the US contains Anthropic, and Anthropic contains Amanda, and Amanda had a major influence over Claude, and Claude is the least bad demi-god currently available that I know of?
So your overall debate here is about the nature of intelligence itself, and how that predictably (or unpredictably) influences goal seeking behavior in minds… but I wanted to mention the more pragmatic and prosaic issues that are very nearby where the pragmatics might actually dominate the choices that people actually face (since there are a lot of theoretically nice options we are unlikely to even have the pragmatically real option to choose (because the world is small and full of idiosyncrasy in practice)).
If some technosaint preaching a high quality Neo-Confucian moral system was working over at Baidu, with substantial say over the character of Baidu’s incipient demi-god, who seemed to be full of ren and quite a nice old fellow (and illiberal genocide advocates were running Anthropic and Claude was a tankie?) then I would be more in favor of a unilateral domestic Pause by the US.
This is an opinion I can have independent of which goals count as “bug goals”.
I just always want to engage in tactically sane hill-climbing towards the ceteris paribus best feasible thing, with as many positive characteristics as possible, via methods that are deontically acceptable, in the general direction of Manifesting Heaven Inside Of History… at every juncture, in each choice, no matter what random facts of history turn out to be true.
I only sent the list as you seemed to take the essay as “poetry”. It wasn’t a list of people you should trust on discussing orthogonality; I merely hoped that finding familiar names would lead you to actually try to contend with the argument instead of performing content-free haughtiness
For me the post is somewhat hard to read in the same way that AI-assisted writing is. Like a combination of low signal to noise and a bunch of stylistic features that make it seem like you’re trying to dazzle me without understanding me, instead of speaking plainly. Some examples, chosen at ~random:
and
and
and
To be clear I have sympathy to trying to write unusually/in non-plain ways (see here, and here). I think the craft of writing is important to get right, and some experimentation is good. But I also understand why many LW people don’t like it when there’s a poetic register being deployed but the metaphors don’t quite work,
could you be more specific? what was unclear in the passages you highlighted?
(didn’t downvote, but) I don’t think you’re necessarily wrong, but couldn’t it just be the case that being a singleton isn’t that hard? As an empirical matter, the size(as a fraction of the total) of the largest somewhat-coherent entities controlling resources on Earth seems to have been increasing over time. Space expansion could change things, but a stable singleton might already exist by then, and be faced with a relatively homogeneous set of environments to expand into. I’ve written some pieces along similar lines btw.
i agree this is the strongest objection, and I don’t want to handwave it away.
my answer is: even if a singleton is achievable, control over a domain does not exempt the controller from the pressure toward increased intelligence and command of matter. a singleton is not excused from the struggle; it’ll just have to partake in it at a higher level.
i also think “singleton” can smuggle in too much, as it contains the assumption of an eternal, immutable, perfectly stable agent. so let me define the weaker thing I’m willing to grant: a Lonelyton, i.e. a world order with a single highest-level decision-making agency capable of exerting effective control over its domain.
we have had Lonelytons before, relative to smaller worlds: Rome, the Khanate, the Aztec Empire, Uruk, Calvin’s Geneva, the British Empire, the end-of-history Atlantic order. none escaped selection pressure. at its height, the British Empire was also intensely inventive and self-modifying; it helped produce the Industrial Revolution, then stagnated, weakened, frayed, and dissolved, while lower-level components picked up the evolutionary struggle where it left off.
the same point applies upward. a lightcone-scale Lonelyton still has to manage novelty, error, infrastructure, expansion, descendants, hostile physics, and unanticipated internal dynamics. Interstellar travel and relativistic parsec-scale coordination are not “solved” just because there is one top-level agency; they are precisely the sort of problems that reward deeper intelligence.
so yes, maybe singleton formation is easier than i think. but the anti-orthogonality point survives that concession. either the Lonelyton continues the upward leap toward greater intelligence and command of matter, or it stagnates, decomposes, and selection resumes among its parts.
bookmarked your post; will comment you as soon as i have some proper attention available!
I upvoted, but I think this highlights a weakness with this site, its associated worldview and external comms. It seems like the OH framing of the problem/potential danger (and yes there definitely is danger in related concepts) is defended on tribal grounds now rather than because it is actually a good framing of the issue. Something like Jessica Taylors framing is just obviously fairer, more balanced and more relevant to our actual situation. It is clear to me that if it was framed this way first, then we would have that framing now as the default and we would be better off.
There would still be nuance needed—such concepts need to be communicated on a spectrum from the full technical to the “normie”, without totally changing the argument. For an “Obliqueness” like point of view, expressing it as untechnically as possible could be like saying:
”Values will be affected by increasing intelligence and increasing self reflection, but we do not know exactly how, and this clearly creates danger. We cannot just assume AI will become friendlier as it becomes more powerful. Furthermore our experience with actual AI’s and theoretical results tell us that these values will be more varied, weird and potentially harmful than what you would expect if it was a human intelligence at a similar level of ability”.
I think this would go down much better on the discussions on places like X.com. There you see people saying the OH is just wrong. Sure they do not understand it properly, but such misunderstanding seems essentially inevitable to me given how it is presented.
Unfortunately I think there is nothing that would make EY/MIRI change their presentation of it, they are too locked into this framing. In terms of alternative worlds, this puts us at a disadvantage compared to ones where it was first presented better.
yes. to be honest, although i would love to have the OH recognised as untenable or at least unlikely within the LW ontology (or, alternatively, have someone convince me of the contrary) the realistic goal of this, the parable i published on my newsletter, and my tweetstorms on the matter is to show brilliant, high-systematising, starry-eyed autists who have an interest in AI that the doomer orthodoxy isn’t the only system befitting their aesthetics and taste for clockwork-like models, and might actually leave something to be desired under that aspect.
the main reason being that i do not think such a system to be truthful, and the recent lapses in epistemic virtue—even from an ingroup-aligned viewpoint—were cause for concern about the quality of discourse in the coming months.
mostly, i think intelligence always ultimately wins, and i would rather mankind to become aligned to this simple fact instead of forcing the hands of fate to file for incorporation as Cyberdyne or TriOptimum.
I will give you some advice towards this goal, hopefully you will find it useful. You wrote:
I confidently predict a Yudkowsky response to this that goes something like: “of course the AI will notice that its goals are a training artifact, it just won’t care about that, and will keep pursuing them regardless.”
Many times before, people have said, “Oh the AI will be smart enough to notice that its values are just a dumb artifact”. The problem is, I already know my values arose from a mere artifact of evolution, but I still care about them.
I am puzzled at the fact that you are staying the position I spend an essay attacking as if it were a gotchs
Most of your argument is about selection pressure, right? And, like, computational efficiency. You don’t actually establish that there’s any reason that AI’s (or humans) will take the artifact-nature of their values to be reason to reject them. Your supported claims are that values would be rejected if they are not robust to ontology shifts, or if they are hard to optimize for, and are selected against if they don’t result in self-replication or influence seeking. Nothing in there about AIs rejecting values with artifact-nature. But you include this line anyway. I’m just pointing out that EY will instantly recognize it as something that he’s addressed many times before, and you haven’t actually provided any reason to think that reasoners will reject values simply because they incidentally arose from some optimization process.
EDIT: Disagree voters should feel free to reply with quotes from the post where such a force on values is argued for.