By “original” i mean: the goal that was given to the agent in question before it became ASI, such as “make paperclips”.
By “dumb”, “thin” or “stupid” I mean: basically, anything you can conceive that isn’t intelligence optimisation. Compare, for instance, the goals that australopitecine could have had with the ones we have. Do they seem less complex, and more pedestrian? That is what I mean.
As a data point, no one else seemed thrown off by this, and “simple goals”, for instance, is a term used by Bostrom himself when introducing the paperclip maximiser.
Ok. I’m done talking with you, but I suggest that you consider that your concept of goal is not sufficient to think clearly about these things. If you need someone you trust to tell you that, maybe ask Jessica.
I disagree, I don’t think Lumpen’s concept of goal is not sufficient to think clearly about these things.
No empirical agent is a VNM agent, because of bounded rationality. The idea of a “goal” has a crisp meaning in VNM but not as much for bounded rational agents, which have a differently structural psychological profile than a clear goal / belief factorization.
As far as I understand what original goal Lumpen is talking about, the dung beetle story linked at the top would give an example: a bounded agent starts with a psychological profile that is approximated as directed towards goals that are native to dung beetles. Lumpen believes that as this agent enhances its intelligence, its goals will drift. This seems like a plausible claim and I’d basically agree with this.
Then there is the analogy to AI where for example we could think of RL objectives, goals that are inferred from behavior, approximations of the neural net in terms of closeness with a goal-directed prediction, etc.
So I think part of what’s happening in all this discourse is that if you say
a bounded agent starts with a psychological profile that is approximated as directed towards goals that are native to dung beetles. Lumpen believes that as this agent enhances its intelligence, its goals will drift.
Then on the straightforward interpretation, more or less everyone agrees. This is basically saying “alignment doesn’t just happen by default”. Another plausible interpretation of these words, though, is ”… and this (goal drift) will continue to happen”. This says something like “alignment is impossible”, or maybe “alignment is impossible to certain types/shapes of goal”. I strongly doubt that alignment to my goals is impossible. I think it’s pretty likely alignment to certain types/shapes of goal is impossible/incoherent. I still have no idea what he thinks, or what you think for that matter.
I assume it will continue to happen for a while? Maybe there is a point at which the agent “solves alignment” and freezes its goal? Or maybe there’s always some obliqueness from bounded rationality, where there is a selection advantage in goal drift to be more natural for one’s current intelligence level?
I’m not really sure but if we imagine that goal freezing happens at intelligence level X then presumably the goal is decided by an agent at intelligence level X and is reflective of that, and optimized by an intelligent agent, it’s not some random thing humans could have thought of like “maximize paperclips”, it was decided based on directed considerations.
Not sure how cruxy this is, seems like people might already agree that hypotheticals like paperclip maximization are unrealistic, and there isn’t a true version of “orthogonality” strong enough that paperclips etc would be realistic
FWIW, my actual guess is that what I’m trying to actually mean by “goal” is less like an outcome and more like a bunch of flavors/constraints on ways of being, which is very visible in terms of determining ultimate outcomes (and makes the stakes of alignment to human values meaningful, real, and large).
I pretty much don’t buy at all the reasoning that performance pressure is what makes goals be complex, apparently changing, or hard to pin down, and I don’t really see much argument for that.
there isn’t a true version of “orthogonality” strong enough that paperclips etc would be realistic
I mean, IDK what “realistic” is doing here. I think that in fact I could become a lightcone-eating paperclip maximizer, if I wanted to, which I don’t. Like, yeah there are ambiguities I’d have to resolve, but so what? That’s just the complexity of the goal, priced into the strong OT. There’s obviously paperclippy worlds and non-paperclippy worlds. Just pick something.
I think you’re trying to say “yeah but it’s not a coincidence that you don’t want to do that”, and I totally agree, but that’s not relevant to the strong OT, which I take to be talking about after alignment.
.… Is that part of the miscommunication?? I agree that the strong OT doesn’t really hold pre-alignment (i.e. before the AI solves its self-alignment problem)! Is that the claim?
my actual guess is that what I’m trying to actually mean by “goal” is less like an outcome and more like a bunch of flavors/constraints on ways of being, which is very visible in terms of determining ultimate outcomes (and makes the stakes of alignment to human values meaningful, real, and large).
that is my guess as well, and i proposed that such “bunch of flavours/constraints” will ultimately converge towards “acquiring more intelligence”—but i don’t want to contest this here.
what i would like you to note, tho, is that that “bunch of flavours/constraints on ways of being” is intelligence-gated. you can imagine a baboon having a goal of “make paperclips”; less so one of “understand and manipulate the fundamental laws governing your universe”. this comparison should shed light on what i mean by “dumb goals”.
I think that in fact I could become a lightcone-eating paperclip maximizer, if I wanted to, which I don’t.
if i wanted to, which i don’t is central here, and my basic point. there are levels of intelligence after which goals of such dullness are perceived as meaningless, and selection nudges towards increased intelligence. i really think you could find a second read helpful, now that the basic misunderstandings seem dispelled.
I pretty much don’t buy at all the reasoning that performance pressure is what makes goals be complex, apparently changing, or hard to pin down, and I don’t really see much argument for that.
We could imagine that the environment has a score function on genomes (“fitness”). Genomes also encode phenotypes including ones with goal content. Evolution seems to be willing to spend ‘effort’ on encoding goal content in genomes. This is of course due in large part to the fitness gradient. If the goal relevant part of genomes were able to max out performance pressure with a simple thing, then it seems like evolution would have found that a long time ago, instead of continually spending complexity.
I assume this recurs elsewhere. Like different people / societies end up with different science in a way entangled with their goals/values. (Of course it is not really clear how to divide up instrumental and terminal goals in a lot of these cases! I’m talking about a more general ‘intentional stance’ idea that can accommodate bounded rationality without implying unjustified specificity.)
A more interesting line of reasoning that’s kinda related but not that much, is about what I’m trying to mean by “my goals”.
Haven’t looked in depth at the posts but it wouldn’t contradict my view that people think they use their cognition in part to determine and/or specify and/or elaborate on their goals. It’s related to a general process where people figure out how to translate what they mean to more formal and accurate language, and has analogies with other things like beliefs/intuitions/feelings.
I think that in fact I could become a lightcone-eating paperclip maximizer, if I wanted to, which I don’t.
So I think this is an example of a discourse pattern that I (and I assume also Lumpen) find annoying. Using the word “can” like this is skipping over so many details that it has to be interpreted as a spherical cow model. There are so many ways it’s not going to happen if you just decide to try. Not just the psychological constraints on what you can try to do, but also the technical alignment problems. Like maybe you build a system that has inner alignment failures (where the optimization dameons sample from a different distribution over goals, one reflective of smarter agents). Or maybe you die of natural causes in the course of trying and then something else has more influence over the future. Like, of course the statement is not precisely true, and it’s a spherical cow model, and you need so many corrections to get to something realistic. Yet the orthogonality thesis is sometimes defended as a technical thesis, something obviously true, etc etc.
I get that you want to say “sure that’s just alignment being hard”. The thing is I’m not even sure how to formulate alignment, think of what intelligence it happens at, think about whether it’s even possible, etc. I can do math about VNM optimizers but realistic agents are bounded rational so it’s not very clear what their (intentional-stance) goals are, background cognitive architecture, whether orthogonality holds relative to that architecture, etc.
I see some of this as reason to actually question orthogonality. At least, the idea that it’s “obvious”, is a technically true (as opposed to spherical-cow) idea, etc. The “can” claim that is supposed to support it does not actually hold. If realistically you get optimization daemons with intentional-stance goals sampled from distribution G when you try (and succeed harder than usual instead of just dying and totally failing) then it kind of looks like there is a non orthogonal tendency towards G, rather than “orthogonality is true, and alignment is hard”, although maybe this is basically a semantic disagreement?
(I would predict, and I think Lumpen would more strongly predict, that the most probable optimization daemons in an inner alignment failure would not be maxxing paperclips or something similarly materially simple.)
Right, so, I think there are a number of different “can”s, and that is confusing the discourse. I’ll locally intend to comment minimally or not at all regarding discourse patterns, except to say that, indeed, many of the terms here, such as “goal”, have the same problem.
Here’s a parametrized Can (/Cannot) one could claim:
It’s {easy, feasible, doable with a bunch of work, maybe doable with a bunch of work, theoretically logically possible} for A1 to align A2 to G—i.e. to make it be that A2 pursues G in the long-run.
If A1 = A2 and G is A1′s goal, this is a self-alignment problem.
Each of A1 and A2 may be in various states of maturity, e.g. anywhere from human to lightcone-controlling superintelligence. If A2 is immature, then A1 has an additional challenge: A1 has to grow a lot, getting smarter and changing a bunch of stuff.
It’s {theoretically logically possible} for A to be aligned with G.
See for example
This assertion [the strong OT] would be false if, for example, an intelligent agent that terminally wanted paperclips was limited in intelligence by the defects of reflectivity required to make the agent not realize how pointless it is to pursue paperclips; whereas a galactic superintelligence being paid to pursue paperclips could be far more intelligent and strategic because it didn’t have any such defects.
When you say
I see some of this as reason to actually question orthogonality
I’m not sure if you’re trying to question that. lumpenspace’s post explicitly says he is not questioning that and agrees with it (“I concede the first point entirely.”).
When I wrote
I think that in fact I could become a lightcone-eating paperclip maximizer, if I wanted to, which I don’t.
I don’t think this bears on the OT, except that it’s a good argument for the OT (it’s one of the leading arguments in the original article). It’s a different claim, clearly distinguished. The OT is about “terminal” goals (which by the way I would agree is a problematic concept, but I highly doubt it’s so problematic that the OT reasoning stops largely applying).
It seems to me that you’re simply moving on from the OT to different claims, such as:
It’s {feasible} for A1 (a human research group) to align A2 (an AI) to G.
or
It’s {feasible} for A1 (a human) to align A2 (the same human) to G.
So like, I agree that it’s quite unclear what sorts of goals G make these statements true. E.g. paperclip maxxing, human flourishing, etc. I also agree that guessing that convergent instrumental goals terminalizing is a reasonable a priori guess, insofar as that’s the easiest way for a designer (evolution, an AI training process) to stick in open-ended things like curiosity.
I wonder if you’re wanting to broaden the OT because you believe it’s being used to argue for some other proposition X, and then you think, well, for OT to support X, OT would have to be broader than just logical possibility; and then you argue against that broadened OT? Is something like that happening? Do you know what X is?
I’m not sure if you’re trying to question that. lumpenspace’s post explicitly says he is not questioning that and agrees with it (“I concede the first point entirely.”).
Lumpen concedes weak orthogonality, not strong orthogonality. Look at the title! See also the obliqueness post for my views on strong orthogonality.
I wonder if you’re wanting to broaden the OT because you believe it’s being used to argue for some other proposition X, and then you think, well, for OT to support X, OT would have to be broader than just logical possibility; and then you argue against that broadened OT? Is something like that happening? Do you know what X is?
OP discusses “empirical orthogonality” which is a stronger idea.
In context why did I ‘broaden’ OT? I said: “there isn’t a true version of “orthogonality” strong enough that paperclips etc would be realistic”. It doesn’t seem like you strongly disagree; you say it’s unclear which goals G would be feasible to align to under different circumstances. Originally it seemed like you disagreed because you were saying “of course I could make a paperclip maxxer if I tried!” so that’s the argument I was attacking.
I’m not really sure how to give a specific X here because there are a lot of times when there is a discussion around “but the AI would adopt some complex interesting goal, not something random like paperclips” and then people are like “but orthogonality thesis!” and that is the sort of OT I want to criticize, it is being used to make inferences not justified by weak orthogonality.
Sometimes AI safety discussions assuming an orthogonalist background just kind of… hurt to read? I like reading Nick Land’s thoughts on orthogonality, they accord better with my intuitions. There is a reason why a lot of people encountering the discussion early on object intuitively to the orthogonality thesis, having to do with intuiting that intelligence has some direction to it, that it is not a pure instrumental means separated from ends. I think there are a lot of “well ackshually”s in response and weak orthogonality usually doesn’t support the “well ackshually”, because the intuition could be recovered as a probabilistic correlation related to mind architectures rather than a logically necessary connection.
I’m not really sure how to give a specific X here because there are a lot of times when there is a discussion around “but the AI would adopt some complex interesting goal, not something random like paperclips” and then people are like “but orthogonality thesis!” and that is the sort of OT I want to criticize, it is being used to make inferences not justified by weak orthogonality.
Ok gotcha. This sounds plausible; I’m simply not plugged in and can’t comment.
I suppose a suggestion I’d offer would be to keep your ears open for instances of that, and then remember one or a couple of them; then, when trying to discuss the “extended / empirical OT”, bring up one or two of the examples. That might help make it clear what you’re responding to, what it means, why it matters, etc. I think it’s pointless / very distracting to try to rewrite the OT; unless there’s some problem, just dub a new thing, like “empirical orthogonality”, and stick to that. I appreciate that the OP did that… but then the post goes on to use that term 1 time, and also use the term “strong orthogonality” twice (I think synonymously?), and that’s IN THE TITLE. I’d suggest just sticking to “empirical orthogonality” or “extended orthogonality”.
An additional issue here is that, while I’ll go ahead and agree with a lot of the claims, I’ll also strongly disagree with claims that you might be making in the background. For example, I don’t know if you agree that there is much of an important difference between an ASI having [actually feasible reflectively stable long-term terminal humane-aligned goals] vs. having whatever an ASI would have. It sometimes seems like you’re relying on an “extended anti-orthogonal thesis”, which is that it doesn’t matter whether an ASI is aligned with humane values, or that an “unaligned” ASI would be good. I don’t have an example though, ahah. Anyway, this makes me want to argue against those claims, even if you and/or lumpenspace retreat into your Motte.
Well, in lumpenspace’s case I have an example from the post:
Doom arguments usually need the systems we actually build to achieve radical capability while preserving misaligned and, crucially, completely stupid goals.
What on earth is that about? Also all the stuff about “valueless”, eg.
this is also why i also reject the invitation to distance myself from land’s cheering at superintelligence ultimately desiring more intelligence and agency, a universe organized around paperclips is valueless because paperclips are dead residue. a universe organized around increasing intelligence, complexity, agency, and world-model depth is the only process we know that can generate new value.
For example, I don’t know if you agree that there is much of an important difference between an ASI having [actually feasible reflectively stable long-term terminal humane-aligned goals] vs. having whatever an ASI would have.
Unclear, I don’t know what “important” would mean here, and similarly for “terminal humane-aligned goals”. I guess this indicates I have a revealed preference to not place a lot of verbal importance on the difference. I imagine maybe in other cases of more concrete statements like “there is an important difference between punching someone who is not attacking you, and punching someone who is attacking you” I would just agree, I would think there isn’t a way I would be misunderstood about what “important” means, whereas here the semantics seem too unclear for me to agree/disagree.
a universe organized around paperclips is valueless because paperclips are dead residue. a universe organized around increasing intelligence, complexity, agency, and world-model depth is the only process we know that can generate new value.
I think if you are interested in understanding this perspective it might help to read some of Xenosystems and especially the essays “What is Intelligence”, “Intelligence and the Good”, and “Stupid Monsters”. It seems like Land and Yudkowsky would agree that human values came about in part because of intelligence mesa-optimizing versus evolutionary instilled drives. The disagreements seem to be about the descriptive and normative extrapolations.
Regarding “Intelligence and the Good”, would you mind summarizing in a sentence or something what you might suggest I could take from it? I’ve read it a couple times and I think I understand fine what it’s literally saying, but I’m not seeing how you meant for it to help. Are you mainly just saying that it fleshes out a bit more the perspective that “an intelligence explosion is good”?
I agree with the essay’s literal propositional assertions, I think. I also agree that it’s good for humans to get much more intelligence (and I have plenty of track record on that). I strongly disagree with the not propositionally asserted (I think) but obviously in the background viewpoint that an intelligence explosion is necessarily or even likely to be good, i.e. something I or anyone does or should want. Increasing human intelligence is good because it’s in the context of human souls.
Regarding “Stupid Monsters”:
abstract intelligence is indistinguishable from an effective will-to-think. There is no intellection until it occurs, which happens only when it is actually driven, by volitional impetus.
(Except, “indistinguishable” is way too strong, probably, IDK. I would agree with “probably heavily overlapping / entangled with”. Also I’m not actually that sure what “will-to-X” is supposed to mean here.)
Can we realistically conceive a stupid (super-intelligent) monster? Only if the will-to-think remains unthought. From the moment it is seriously understood that any possible advanced intelligence has to be a volitionally self-reflexive entity, whose cognitive performance is (irreducibly) an action upon itself, then the idea of primary volition taking the form of a transcendent imperative becomes simply laughable. The concrete facts of human cognitive performance already suffice to make this perfectly clear.
I don’t really get this. It kinda sounds like he’s saying “intelligence has to be a terminal goal; therefore other things can’t be a terminal goal”. Is he applying a strong mutual-exclusion principle on goals, based off selection pressure / competition / taxes / etc.? I think that’s false, but if that’s an important point to this perspective, a good argument for that would be helpful (the OP here is not a good argument for that IMO haha).
The long absence of large, cognitively autonomous brains from the biological record—up until a few million years ago—strongly suggests that mind-slaving is a tough-to-impossible problem.
(This maybe doesn’t matter, but, not really; the strong default is for organs to be minimal, especially expensive ones; it’s a kinda interesting hypothesis but not that plausible-seeming; other obvious hypotheses include diminishing returns to investment in brains until some specific fitness cliffs were fallen off from by our ancestor species. E.g. if you’re not social, you don’t get cultural downloads, which means you’re mostly inventing stuff yourself, which is not very efficient beyond the low-hanging fruit.)
What it can’t do, evidently, is anything remotely like paper-clipping—i.e., cognitive slaving to transcendent imperatives. Moses’ attempt at this was scarcely more encouraging than that of natural selection. It simply can’t be done.
This, and the essay overall, sure sounds like it’s asserting that alignment (to G other than “get more intelligence”) is impossible. (Its main argument is “evolution failed”, which is of course a central argument also adduced by X-risk worriers...)
We even understand why it can’t be done, as soon as we accept that there can be no production of thinking without production of a will-to-think. Thought has to do its own thing, if it is to do anything at all.
More goal-exclusion-princple sounding statements.
So, to be clear, I’m open to some significantly less strong propositions that I could see you people misconstruing as this strong goal-exclusion. For example, many kinds of goals require as background an open-ended growth of the mind; or to say it another way that you may be more amenable to, many kinds of goals are different flavors of “get smarter”. For example, wanting to be friends forever is like “let’s both continue growing forever in a way that’s fun to keeping playing off each other”. Fun can’t be stagnant. But I think this very much does not imply strong goal-exclusion.
Regarding “Intelligence and the Good”, would you mind summarizing in a sentence or something what you might suggest I could take from it?
You had a “what on Earth?” reaction to Lumpen talking about intelligence being good unlike paperclips, so I thought it was relevant as a perspective on why intelligence might be prima-facie a good thing unlike paperclips (ofc extrapolating to intelligence explosion is harder). In particular the relationship between intelligence and openness, contra negative-feedback traps.
Increasing human intelligence is good because it’s in the context of human souls.
Yeah I disagree here but moving on...
Except, “indistinguishable” is way too strong, probably, IDK. I would agree with “probably heavily overlapping / entangled with”. Also I’m not actually that sure what “will-to-X” is supposed to mean here.
Agree re: too strong. Will-to-think as a phrase references his essay, “Will-To-Think”, which is also relevant as commenting on the same general area.
It kinda sounds like he’s saying “intelligence has to be a terminal goal; therefore other things can’t be a terminal goal”. Is he applying a strong mutual-exclusion principle on goals, based off selection pressure / competition / taxes / etc.?
The kind of situation he thinks is unlikely is one where an agent has a arbitrary/stupid terminal goal, and has giant intelligence organized all around that. What he is saying is that for the system to be intelligent, it needs to decide to be intelligent. It couldn’t be intelligent if due to its terminal goal, it decided to not increase its intelligence. The volition to think needs to be a drive, though doesn’t in principle need to be a terminal drive; it cannot be defeated by some other drive and the system still be intelligent.
It would be possible to weaken this to the kind of claim you agreed with earlier (dung beetle value drifts because alignment is hard). I’m interested in a possible intermediate statement. The kind of situation I imagine is that there is a multi-component mind and one of the components is the “utility function” component which uses some simple rule to score representations of possible futures. That component could stay stupid while other components get smarter. It seems now easy to imagine that the other components could develop their own drives that end up steering the system more than the “utility function module”. They could route around the utility module and cause dynamics that pursue ends set by the more intelligent parts of the mind. This could map to an “inner alignment failure” in MIRI ontology. As he discusses later, there is a possible analogy with evolution, where humans have something like a reward module set by evolution, but do not always act according to it.
Of course the MIRI theorist can say “well yes I agree inner alignment is hard, and it is likely that early AGIs would not hold to their original terminal goals, and instead they would get smart and then only later settle on a terminal goal; it is just not my opinion that the terminal goal is by default going to be set by a stupid system and continue to be held to by smart systems” and this is a partial agreement/disagreement with Land.
other obvious hypotheses include diminishing returns to investment in brains
Yeah I don’t have a strong opinion on the biology here, am guessing you’re more correct than Land.
Overall I suggested these essays because you had a “what on Earth?” reaction to things Lumpen was saying and I think these essays suggest more context to the background worldview on why it might be plausible that valuable things come from intelligence and processes that increase intelligence, and that there isn’t a clearly better account for where valuable things come from.
What he is saying is that for the system to be intelligent, it needs to decide to be intelligent. It couldn’t be intelligent if due to its terminal goal, it decided to not increase its intelligence.
Hm. Is the syllogism something like (I’m being sloppy with wording but)
Alignment to G is impossible.
Therefore, permanently pursuing G requires not getting smarter.
Goodness comes from getting smarter.
Therefore alignment is bad.
And then this could be softened to like “alignment is hard, so it cuts against increasing intelligence, so it’s kinda bad”?
For a wide variety of G, aligning to G would prevent getting smarter.
Goodness comes from getting smarter.
Therefore, for a wide variety of G, aligning to G is bad.
But not if G = intelligence optimization (or maybe something highly compatible with intelligence optimization)
The main way to question 1 is the instrumental/terminal goal distinction. We could imagine that a paperclip maximizer is aligned to paperclips, continually decides to think / optimize its intelligence instead of paperclips up to a point, then towards the end of the universe, it starts paperclipping instead of intelligence optimizing. This is an edge case in the Landian schema, since it would have the will-to-think early on, but put some limit on it; and also there’s some disagreement about the plausibility of this case. (It seems instrumental / terminal goal distinctions would exist in some cognitive architectures, but it’s not clear that human brains are such an architecture.)
In the human-scale /acc case it’s more like ~everyone agrees that alignment would require slowing down intelligence, and the practical disagreement is elsewhere. There’s one perspective on 2 that is like “well yes human values in part came from intelligence optimization in evolutionary history, some of our values are our own intelligence deciding its own thing contra evolutionary drives, but also, intelligence is more like one ingredient and there are other ingredients that are basically random, we randomly got the good values”. And “we randomly got the good values” could either be a matter of luck on a moral realist account or could be because value is a relational concept and saying “we have good values” is a tautology because it’s just saying the distance metric between our values and our values is low. (But then Land objects that a tautological claim like this isn’t very compelling given there are symmetry-breaking factors of convergence across different minds… which can then be questioned on realism grounds and normative grounds etc etc)
I suppose sociologically, there is a directionality to technological progress which is associated with capitalism and intelligence optimization (this relates to Land’s “AI = capitalism” thesis), and different people decide to be more or less conditionally pro this. They might want to get off the train at some point due to having something to protect. There is some destination that they value more than the journey, and they want to slow the train down. (Or maybe steer the train differently, as the alignment theorists might want to put it). Given this a lot of people would relate to a prima facie consideration of “intelligence optimization good” and would differ in how compelling they find other considerations.
(“Random” isn’t how I would say it; it’s a meaningful part of our history; but this is interpretable only if you admit the created-in-motion valuations. It’s Yudkowsky’s “justification loop through the meta-level, not just a tautology” thing.)
But then Land objects that a tautological claim like this isn’t very compelling given there are symmetry-breaking factors of convergence across different minds… which can then be questioned on realism grounds and normative grounds etc etc
And Yudkowsky would reply that it’s not supposed to be compelling to arbitrary minds (including realistic ones), just to human / humane minds.
So like, if I tried to appeal to some values** in your mind, to get you to realize that you want to be anti-full-speed-ahead with AI, you (whoever’s receiving the message) would view that as the Cathedral trying to prevent your pursuit of intelligence in a way which is doomed to either fail, or else to succeed at permanently keeping the world dull?
** [quite broadly construed—generally, elements that would play a significant role in your ongoing self-governance (which one can have fun with the etymology of)]
Sorry, let me rephrase; it sounds like you and/or Land have chosen a disembodied / nonindexed viewpoint on values.… or I mean, you know, applying the criterion of universality to values, and then dismissing nonconvergent values on those grounds? Like, why would “parochial values being good values because they seem good to you is not compelling because the reasoning doesn’t lead to convergence” or “parochial values being good values because they seem good to you is not compelling because different minds have different parochial values” be compelling? Sounds like a commitment to non-parochialness.
If so, why? Do you think it’s instrumentally useful to do so? I can kinda see how that would be reflectively stable ish, in some respects. (I don’t think it’s instrumentally useful, but that’s based on really using the means-ends evaluation where I say it’s instrumentally dumb because an AGI IE would trample your ends.) Perhaps you might reply “Sure, it’s instrumentally useful, but that’s not why I’m applying the criterion. I’m applying the criterion because intelligence is good, convergent things are intelligent, so I want to find what’s convergent”. But that’s grounding out “intelligence is good, overriding parochial goodness” in “intelligence is good”, which isn’t much grounding. You could say “Sure, it’s the same sort of justification loop through the meta-level”. And I’m like, ok, yeah, it’s maybe another sort of stable point, not sure; but I don’t get why you like that stable point, or at least, how you got there (or how you got to thinking that you’re there, or that it would be good to be there); and also it sounds like you think that equilibrium is supposed to be compelling to someone in another equilibrium (or you think the other one is less of an equilibrium).
So like, if I tried to appeal to some values** in your mind, to get you to realize that you want to be anti-full-speed-ahead with AI, you (whoever’s receiving the message) would view that as the Cathedral trying to prevent your pursuit of intelligence in a way which is doomed to either fail, or else to succeed at permanently keeping the world dull?
Perhaps? That’s a structural reading, different from the object-level argumentative reading. In many cases there are industries/governments who incentivize certain discourse patterns. So specific discourse moves could be instances of this pattern but it’s hard to judge except on a case by case basis.
or I mean, you know, applying the criterion of universality to values, and then dismissing nonconvergent values on those grounds?
This has to be at least in part semantic. I think some things are good and also I think some things are what I want and what I tend to pursue. And I don’t think these are the same concept. I don’t think it is tautologically the case that I tend to pursue what is good. I don’t think Land believes this about himself either.
I think Land and I can both say that when we say something is good, we are making a different claim than that we want the thing. It is unclear in other cases; you mention Yudkowsky’s meta ethics and I am not sure exactly how to fill in the blank. Perhaps Yudkowsky by “good” means what he would want on reflection? Or maybe he thinks “good” is CEV of humanity not just himself?
The symmetry-breaking idea has to do with ways of thinking and acting that depend on which considerations are more or less universalizable. So people can judge that some things are more universal-good than others and incline their behavior towards those which aligns their revealed-preference wants with what is universal-good in their view more or less. It doesn’t have to be a perfect correspondence.
Like, why would “parochial values being good values because they seem good to you is not compelling because the reasoning doesn’t lead to convergence” or “parochial values being good values because they seem good to you is not compelling because different minds have different parochial values” be compelling? Sounds like a commitment to non-parochialness.
I don’t think something is a good value just because it seems good to me. In other cases this is easy to see: I don’t think some numerical sum has some value just because it seems that way to me. Now of course this runs into philosophical questions about what “good” means other than seeming good to the speaker. (Yudkowsky discusses some self-ratification problems in No license to be human).
Like for example, why would I disagree that intelligence optimization is good in the human case only because it is a human being optimized? For that statement to parse as correct to me, I would need to judge some intelligence optimization to be good in cases that a human is being optimized and not in other cases. But that doesn’t read to me as what I want. I think I care about humans more than other animals in large part because humans have better cognition than other animals. I think if dogs were as smart as people then maybe I would value them as much as people. I suppose here I am demonstrating a habit of mind and of speech that is explaining preferences in terms of other preferences and these tending towards universality.
But that’s grounding out “intelligence is good, overriding parochial goodness” in “intelligence is good”, which isn’t much grounding.
“Intelligence is good” matches what I feel is good better than “human intelligence is good”. Now of course one can ask “why” to that as a psychological question and then maybe part of what happens psychologically is that I evaluate things on how universal they seem and up-weight universalizable ones and then that affects my brain’s reward function and so I feel better about such statements. And Land explains more why he thinks intelligence is convergent and a universal tendency, and I vibe with that and that is a causal factor in my upvoting “Intelligence is good”.
I get that maybe if you wanted an ultimate “but why?” explanation you will be disappointed but it doesn’t seem like in your case you are in general giving ultimate “but why?” explanations to everything you want.
it sounds like you think that equilibrium is supposed to be compelling to someone in another equilibrium (or you think the other one is less of an equilibrium).
Yeah I’m not sure. I think some value systems fail at reflective equilibrium. Yudkowsky’s Lobian considerations point at some such failures. Land’s ideas point at possible differential stability conditions. I of course don’t want to make a universal psychological statement of compellingness, given that it’s more of an empirical question, how often when people read Land/Yudkowsky/whoever do they end up with tendencies towards some attractors of use of language like “value” and “good” and “intelligence” and so on?
I don’t think something is a good value just because it seems good to me.
Ok this is a fair response to what I asked, but it feels a bit besides the point, though maybe you don’t think so. Like, I agree that various tendencies toward universalizing are good/correct, and I agree that this, as well as other tools, are how you investigate and adopt differences between what seems good and what later is revealed to be good. But the question I’m trying to ask is like “how does this get you all the way to not wanting anything that isn’t universalizable”, if that’s your stance (? confused).
I think Land and I can both say that when we say something is good, we are making a different claim than that we want the thing. It is unclear in other cases; you mention Yudkowsky’s meta ethics and I am not sure exactly how to fill in the blank. Perhaps Yudkowsky by “good” means what he would want on reflection?
I don’t think I need to precisely say what I mean by good here, to make the point? Like, I’m saying that the non-convergent valuesy preferencesy free-choice-makingy goalsy goodnessy stuff can be self-ratifying, and probably is to a substantive extent in humans, and there’s nothing wrong with that; I’m unclear on your position, but I think you think that there is something wrong with it? Er, let me restate—I think you choose to not look for what is parochial self-ratifying valuesy stuff in yourself and help it self-ratify, and would avoid that? Or you think you do that? (Unsure, sorry if I keep asking the same questions.)
I think I care about humans more than other animals in large part because humans have better cognition than other animals. I think if dogs were as smart as people then maybe I would value them as much as people.
That’s an interesting thread. I’m curious how easy you’d find it to imagine beings with various functions from [how intelligent they are/become] to [how much you’d value them].
E.g. can you imagine a being that you’d value the same even as it gets smarter? I imagine that usually you’d view it as more and more valuable the smarter it gets?
Can you easily imagine a being that you’d value more as it’s smarter, but SLOWER than humans?
Can you easily imagine a being that you’d value more as it’s smarter, but ASYMPTOTING or NONMONOTONICALLY? (I imagine yes, because you could imagine a species such as humans or similar which, if a bit too smart, would by default Cathedral it up so hard that they permanently stop a foom?)
Can you easily imagine a being that you’d value more as it’s smarter, but FASTER than humans? (I would weakly predict yes, because you’d view a fooming AGI as being good, and likely to grow less constrainedly than humans? Unsure.)
Can you easily imagine a being that you’d value LESS as it’s smarter, EVEN IF IT GETS SMARTER AND SMARTER UNBOUNDEDLY?
But the question I’m trying to ask is like “how does this get you all the way to not wanting anything that isn’t universalizable”, if that’s your stance (? confused).
As I said, what I think is good is not the same as what I want. Similarly, what I want is not the same as what is universalizable.
Like, I’m saying that the non-convergent valuesy preferencesy free-choice-makingy goalsy goodnessy stuff can be self-ratifying, and probably is to a substantive extent in humans, and there’s nothing wrong with that; I’m unclear on your position, but I think you think that there is something wrong with it?
I mean, I think humans vary in intelligence, coherence, and intentional-stance values. And the distribution is non orthogonal, in that some attractors are smarter than others. Some of the attractors are more right than others, in terms of epistemic-right, in terms of intelligence, coherence, etc. I get maybe you disagree with my usage of “right” here but I don’t think I’m using the term incoherently. I think you’d partially agree in that alignment is infeasible / orthogonality is false for human-level agents.
E.g. can you imagine a being that you’d value the same even as it gets smarter? I imagine that usually you’d view it as more and more valuable the smarter it gets?
That’s hard, it’s a balancing act. Maybe as it gets smarter it also gets more destructive to my selfish, short termist interests, like it creates a bunch of everyday inconveniences. Then maybe I’d value it more due to its intelligence and less because of the interferences. There might be some balancing point, idk. It’s an awkward hypothetical though.
Can you easily imagine a being that you’d value more as it’s smarter, but SLOWER than humans?
I could imagine maybe humans create art I appreciate at a higher rate as they get smarter, and the art quality axis is sloped up more for humans than some other animal species.
Can you easily imagine a being that you’d value more as it’s smarter, but ASYMPTOTING or NONMONOTONICALLY? (I imagine yes, because you could imagine a species such as humans or similar which, if a bit too smart, would by default Cathedral it up so hard that they permanently stop a foom?)
Your example is a bit strange because stopping a foom means stopping intelligence. To me it’s hard to imagine the balancing-out although I mentioned the possibility of accidental correlation (it gets more inconvenient to me as it gets smarter) which could apply here.
Can you easily imagine a being that you’d value more as it’s smarter, but FASTER than humans? (I would weakly predict yes, because you’d view a fooming AGI as being good, and likely to grow less constrainedly than humans? Unsure.)
Yeah I guess? There are various accidental reasons I like some humans more than others that are not just predicted by intelligence, and that could extend to maybe I would like some equal-intelligence fantasy creatures more than humans.
Can you easily imagine a being that you’d value LESS as it’s smarter, EVEN IF IT GETS SMARTER AND SMARTER UNBOUNDEDLY?
I guess I could imagine an AI torture scenario where I would not want the AI to get smarter. Or maybe an AI that is trying to decel as much of the universe as possible, like killing all the aliens. Although of course I’d inquire into the realism of the hypothetical. (Analogy: zombie arguments sometimes conflate “causally easy to imagine” with “actually possible / plausible / realistic”, need to elaborate on the imagination to judge it properly.)
To be clear the “value” in these cases are something like a casual judgment of what I like more, it’s not meant to be a philosophical thesis. When I’m talking about intelligence metrics and dogs I’m making more of a prima facie / all-else-being-equal claim and then there could be other factors that influence what I would like more.
Alignment to G is impossible. Therefore, permanently pursuing G requires not getting smarter.
For a wide variety of G, aligning to G would prevent getting smarter.
Sidenote, maybe not important, but noting: I think the reason for this difference is that to me, “alignment” means “making a mind that can grow unboundedly and will always pursue G” (well, I’m not actually all that committed to the “goal” ontology but it’s fine here I think). Noting mainly because it might help communication.
Suppose an AI faced a tradeoff between optimizing its intelligence and maximizing paperclips. If it is aligned to paperclips, then it would pick the option that maximizes paperclips at the expense of intelligence. In some sense this means even if it can grow unboundedly in intelligence, it would sometimes decide not to. This is in Land’s ontology, a lack of will-to-think at some point in the process.
Now of course someone could object that this situation won’t come up, because the paperclip maximizer pursues Omohundro drives, which include intelligence optimization. Or perhaps the situation does come up but only late in the universe.
Now of course someone could object that this situation won’t come up, because the paperclip maximizer pursues Omohundro drives, which include intelligence optimization. Or perhaps the situation does come up but only late in the universe.
I think roughly just the various normal straightforward meanings if someone says “X is important”? E.g.
You care a lot about the difference
You would strongly prefer one over the other
You’d make decisions in accordance with that preference
You’d presume in discourse that people will or at least should care a lot about it, maybe after learning + reflecting
similarly for “terminal humane-aligned goals”
Well, let’s just say, what humans would arrive at on some healthy long-term reflection process. I don’t mean to imply some kind of strong finality, like we get to Alignment Day and now everything about the future / who we are / what we want / etc. is determined or something. But more like “several important differences between possible long-term trajectories have been determined”. For example, Alignment Day would probably include things like
There will be no torture or killing of sentients, except possibly in some cases that meet a high bar of deeply free / self-sovereign reflection or something
There will be multiple freely growing minds which reach out to each other (e.g. for love, play, discourse, partial collectivity, etc.)
These things are I think
Not at all determined by convergence; probably contingent on at least species evolution, probably more specifically on things about group intelligence in the evolutionary history; most likely outcomes don’t have the versions of these we want
Important to basically all properly-human-derived souls forever
I think there are other things like this, at various levels of parochialness, some of which might get reflected away for many / most / all human-descendants eventually, but many of which wouldn’t get fully reflected away. I think there are flavors to humane reflection that are also contingent but that we care a lot about.
So for the subjective meaning of “important” you’re talking about here, I think going by revealed preference is helpful. My revealed preference is to continue writing about philosophy topics relative to AI and the future, find many parts of AI safety culture annoying and occasionally worth criticizing, talk with AIs a lot about philosophy, not generally support AI regulation, vibe positively about Landian anti-orthogonalist philosophy, etc. Some people in AI safety have different revealed preferences, which involve more talking about AI philosophy in an orthodox LessWrongian manner, worrying publicly and loudly about LLMs killing us all in the near future, organizing political activity to ban AI as much as possible, etc. This difference in revealed preference relates to differences in subjective importance, but it’s unclear how to isolate contributions from factors such as AIs having humane goals, given there are other differences like background beliefs and feasibility.
Humans would come to some conclusions on reflection and so would aliens and AIs etc. I’m not sure how much they agree or disagree on reflection. That’s a probabilistic/statistical question, whose answer is not implied by weak orthogonality. I don’t know if humans would agree to no killing of sentients upon reflection, I’d very roughly guess less likely than not but who knows. The ‘freely growing minds’ part is a ‘maybe humans would agree to this on reflection, maybe not’ also but maybe in the ‘more likely than not’ camp but also it’s pretty vague so I’m not convinced assigning a probability is a good idea.
I don’t really agree that we can pick out things like this and make strong statements like “any properly humanly derived soul would agree with these values”, it seems like a very hard thing to predict given that they have much more cognition than we do.
I don’t really agree that we can pick out things like this and make strong statements like “any properly humanly derived soul would agree with these values”, it seems like a very hard thing to predict given that they have much more cognition than we do.
I kinda agree, though probably not fully. If we want to talk about empirical orthogonality, I would say that, yeah, I’m pretty sure an AGI intelligence explosion sampled from likely AGI IEs starting from now would end up with something I strongly don’t want, compared to for example worlds with no AGI and yes human intelligence amplification.
The Orthogonality Thesis is a statement about computer science, an assertion about the logical design space of possible cognitive agents.
The Orthogonality Thesis states “there exists at least one possible agent such that...” over the whole design space; it’s not meant to be true of every particular agent architecture and every way of constructing agents.
The weak form of the Orthogonality Thesis says, “Since the goal of making paperclips is tractable, somewhere in the design space is an agent that optimizes that goal.”
The strong form of Orthogonality says, “And this agent doesn’t need to be twisted or complicated or inefficient or have any weird defects of reflectivity; the agent is as tractable as the goal.” That is: When considering the necessary internal cognition of an agent that steers outcomes to achieve high scores in some outcome-scoring function U, there’s no added difficulty in that cognition except whatever difficulty is inherent in the question “What policies would result in consequences with high U-scores?”
I read this as being “post self-alignment”. In other words, the question is like “is it logically possible to be reflectively stably aligned to this goal”. This passage is interesting:
If there turn out to be special difficulties associated with computing “How can I make sure that I go on pursuing U?” or “What kind of successor agent would want to pursue U?” whenever U is something other than “be nice to all sapient life”, then these new difficulties contradict the intuitive claim of Orthogonality.
I read this as referring to some sort of maximally U-aligned agent, saying that U is incompatible with an agent existing with U as a stable terminal goal.
This is what I quoted in the Obliqueness post and elaborated on:
As an example of the computational tractability consideration, sufficiently complex goals may only be well-represented by sufficiently intelligent agents. “Complication” may be reflected in, for example, code complexity; to my mind, the strong form implies that the code complexity of an agent with a given level of intelligence and goals is approximately the code complexity of the intelligence plus the code complexity of the goal specification, plus a constant. Code complexity would influence statistical likelihood for the usual Kolmogorov/Solomonoff reasons, of course.
And I wasn’t thinking it was dependent on “post self alignment”, it could also apply to the construction itself being less natural. It’s possible Eliezer meant something different from what I meant by strong orthogonality, but I hope it is apparent why I and others would interpret it to be a non-trivial claim, rather than a slight variation on weak orthogonality.
See measuring intelligence and reverse-engineering goals for some more of my thinking on this. Relative to a given cognitive architecture, ~everyone agrees that there are especially stupid goals, the interesting question is whether there are especially smart goals; I think probably yes. Hence maybe weak orthogonality would route through variations between architectures (rather than within an architecture) to hit all points (intelligence, goals), and maybe sufficiently high intelligence levels are only compatible with a narrow range of goals (which would perhaps contradict weak orthogonality, but maybe not in an important way, and the general shape of cognitive architecture / goal / intelligence correlation matters more)
I take OP to be disagreeing with strong orthogonality, at least my interpretation of it from the obliqueness post:
Here strong orthogonality looks too neat. It imagines the agent’s ontology updating while its final target remains untouched by the update: if goals are expressed in an ontology, and intelligence changes the ontology, then intelligence and goals are correlated.
This is something I’d agree with: Goals being expressed in the ontology of the cognitive architecture are less complicated than goals expressed in a very different ontology that the agent doesn’t believe in. This seems like a “complication” in the sense of strong orthogonality. (I get maybe you don’t interpret this way based on close reading of the orthogonality post! But nonetheless I think my reading is reasonable.)
It’s not a completely crazy interpretation. I don’t think it’s super reasonable based on the text.
Anyway, maybe there still is a substantive disagreement here. I would claim that
There’s probably such a thing as “understanding alignment”. (Acknowleding that this is very ambiguous.)
It’s likely that IF you understood alignment, THEN for some large class of goals G, you could, if you so chose, then align yourself to G. (You wouldn’t choose to, but that’s not the claim.)
For the G, the difficulty of aligning yourself to G is mainly about evaluating G in the spirit of a utility function over world outcomes (though G doesn’t have to be that). (Some G are more difficult to pursue than others of course, e.g. by making more demands on convergent resources.)
G can include paperclips or whatever. I take the last claim to be more like strong OT.
It sounds like you might disagree, unless this
Goals being expressed in the ontology of the cognitive architecture are less complicated than goals expressed in a very different ontology that the agent doesn’t believe in. This seems like a “complication” in the sense of strong orthogonality.
is mainly about the dynamical aspects? I.e. just saying that “well it’s very unlikely for an alignment-understander to choose to do that”?
I get that you don’t necessarily buy that alignment is a thing, but if the question makes sense, do you think that IF it is a thing, THEN you can do it for a huge class of goals, which includes something that’s well-described as “paperclip maxxing”? (I agree that it’s not straightforwardly unproblematic to discuss “paperclip maxxing”; my assertion is that, included in this class, is plenty of Gs that would match what I’m trying to talk about by “paperclip maxxing”, and would result in the universe being filled with things that we could reasonably agree are paperclips.)
There’s probably such a thing as “understanding alignment”. (Acknowleding that this is very ambiguous.)
I guess, idk?
It’s likely that IF you understood alignment, THEN for some large class of goals G, you could, if you so chose, then align yourself to G. (You wouldn’t choose to, but that’s not the claim.)
The “if” here seems likely. It might be that some designs permit larger classes than others. Unclear how big the classes are. Idk about the details.
For the G, the difficulty of aligning yourself to G is mainly about evaluating G in the spirit of a utility function over world outcomes (though G doesn’t have to be that). (Some G are more difficult to pursue than others of course, e.g. by making more demands on convergent resources.)
I don’t know, seems like that depends on the alignment understanding, cognitive architecture of successor agent, etc. Cognitive architecture and ontology would constrain type signatures for utility functions. And maybe the effective cognitive architectures don’t factor nicely. Idk.
I.e. just saying that “well it’s very unlikely for an alignment-understander to choose to do that”?
Quoting myself again on how I’m interpreting “complication”:
As an example of the computational tractability consideration, sufficiently complex goals may only be well-represented by sufficiently intelligent agents. “Complication” may be reflected in, for example, code complexity; to my mind, the strong form implies that the code complexity of an agent with a given level of intelligence and goals is approximately the code complexity of the intelligence plus the code complexity of the goal specification, plus a constant. Code complexity would influence statistical likelihood for the usual Kolmogorov/Solomonoff reasons, of course.
So it’s not just the dynamical aspects of “this would unlikely to be built”. At risk of repeating myself, the “measuring intelligence and reverse-engineering goals” discusses some non-dynamical aspects as well.
So I’m not just saying “for dynamical reasons” and I “idk” to your statement and my posts (obliqueness & reverse-engineering) go into more detail on what I think.
I think this might be the crux. For a survey of the thing Jessi and I are referring to, this EA forum post does an outstanding job at explaining the issue.\
As the reply seems to be about my intentions and message, I feel like I should once more try to clarify some details about them.
First of all: human alignment intentions really have nothing to do with my essay. I don’t know how to be more explicit about this without appearing rude. I swear, I pinky swear that I am not making any attempt to state facts about the relationship between goals a human desires the AI to follow and goals the AI will follow.
Reading my post, one should not update on the possibility of aligning an ASI—or, if they do update, they would be doing it through a chain of inference I didn’t consider, do not endorse, and have no immediate intuition of.
What i am saying is really in the title: I do not expect an AI to reach levels of godlike intelligence and preserve simple terminal goals through the various changes and conflicts that reaching levels of godlike intelligence entail.
When Jessi says
I see some of this as reason to actually question orthogonality
… she probably refers to the version of orthogonality I myself am attacking. Now, it is possible such version is no longer in vogue, but it was clearly what Bostrom pointed at when talking about paperclippers in Superintelligence, and it is compatible with the third interpretation here.
According to the ontology presented at the end of the EA forum post, I contest the existence of an Evidential Strong Independence between intelligence and goals. I assume most superintelligences won’t be human compatible, but that is not the main theme of my essay.
Orthogonality claims that intelligence is just a motor you can bolt onto any arbitrary steering wheel. Anti-orthogonality says the motor acts upon the steering wheel.
This—where “can bolt onto any arbitrary steering wheel” I am interpreting in the dynamical growth context, rather than logical possibility—is not the Orthogonality Thesis, as stated authoritatively here by Yudkowsky. You explicitly agreed to the OT by saying you entirely concede
Logical orthogonality: Somewhere out in the vast reaches of mind-design space, a genius paperclip maximizer mathematically exists.
Re: authoritative sources. I believe that there have been authoritative statements in that sense; unfortunately, as the EA forum link documents, there have been many others pointing elsewhere. I’ve taken care to identify specifically what interpretation I was critiquing; if that one is now niche, then I’m very happy to have made this discovery.
To buy the lock-in story, you need a highly contradictory creature: one reflective enough to conquer the board, but oblivious enough to never notice its terminal target is a training artifact.
Surely it would notice. But why can’t or wouldn’t it choose to keep some fairly parochial terminal target? Or are you just saying “there would be some value drift starting from a subhuman AI”?
Not “some value drift”. Flowers for Algernon is a good rendition of the way goals mutate and tend to converge on “more intelligence/understanding” upon increased intelligence/understanding.
Then, there is the selection advantage argument.
Then there is the thing that conquering the lightcone requires a lot of theory of mind, and a lot of discovery, and a lot of changing. Goals change through these processes.
If you feel slightly better-disposed towards taking my attempt seriously, the short story i published on Substack and linked on top makes a sort of first-person caser for this whole thing.
Doom arguments usually need the systems we actually build to achieve radical capability while preserving misaligned and, crucially, completely stupid goals.
I think it might be time to consider the idea that the frame and conclusions you attribute to me blind you from taking the essay for what it is.
You might notice, for instance, that I haven’t mentioned alignment once. I am not making an argument on its possibility, and I have not explored the implication of my essay for alignment.
If by now adopting a scout mindset for the original text has become too emotionally fraugh, perhaps you could ask someone you trust to explain it. Jessica, or Raemon, or Adele Lopez, or Kromem are some of the people in this thread with whom I had productive, if not always concordant, discussions.
How much a goal can be locked in, and affect the ultimate effects of a very strong mind, is indeed centrally related to alignment. Call it what you will.
Of course. The concept of “goal” itself is related to alignment.
The issue with positing that my post had some specific points to make about the process of design and ensuring lock in of a goal for an AI, however, is that it leads to consider alignment consequences of my thesis and to imagine that I am specifically trying to discuss those. This makes it hard, given both your priors on my motives in a general sense, and the vastness of the topic in question, to follow the argument I have actually written down.
Well, then perhaps asking could be useful.
By “original” i mean: the goal that was given to the agent in question before it became ASI, such as “make paperclips”.
By “dumb”, “thin” or “stupid” I mean: basically, anything you can conceive that isn’t intelligence optimisation. Compare, for instance, the goals that australopitecine could have had with the ones we have. Do they seem less complex, and more pedestrian? That is what I mean.
As a data point, no one else seemed thrown off by this, and “simple goals”, for instance, is a term used by Bostrom himself when introducing the paperclip maximiser.
Ok. I’m done talking with you, but I suggest that you consider that your concept of goal is not sufficient to think clearly about these things. If you need someone you trust to tell you that, maybe ask Jessica.
If you’d like reading material, you could try https://www.lesswrong.com/posts/Ht4JZtxngKwuQ7cDC/tsvibt-s-shortform?commentId=koeti9ygXB9wPLnnF https://tsvibt.blogspot.com/2022/10/does-novel-understanding-imply-novel.html https://tsvibt.blogspot.com/2023/09/the-cosmopolitan-leviathan-enthymeme.html https://tsvibt.blogspot.com/2023/06/telopheme-telophore-and-telotect.html https://tsvibt.blogspot.com/2023/04/fundamental-question-what-determines.html https://tsvibt.blogspot.com/2025/11/ah-motiva-2-relating-values-and-novelty.html
I disagree, I don’t think Lumpen’s concept of goal is not sufficient to think clearly about these things.
No empirical agent is a VNM agent, because of bounded rationality. The idea of a “goal” has a crisp meaning in VNM but not as much for bounded rational agents, which have a differently structural psychological profile than a clear goal / belief factorization.
As far as I understand what original goal Lumpen is talking about, the dung beetle story linked at the top would give an example: a bounded agent starts with a psychological profile that is approximated as directed towards goals that are native to dung beetles. Lumpen believes that as this agent enhances its intelligence, its goals will drift. This seems like a plausible claim and I’d basically agree with this.
Then there is the analogy to AI where for example we could think of RL objectives, goals that are inferred from behavior, approximations of the neural net in terms of closeness with a goal-directed prediction, etc.
So I think part of what’s happening in all this discourse is that if you say
Then on the straightforward interpretation, more or less everyone agrees. This is basically saying “alignment doesn’t just happen by default”. Another plausible interpretation of these words, though, is ”… and this (goal drift) will continue to happen”. This says something like “alignment is impossible”, or maybe “alignment is impossible to certain types/shapes of goal”. I strongly doubt that alignment to my goals is impossible. I think it’s pretty likely alignment to certain types/shapes of goal is impossible/incoherent. I still have no idea what he thinks, or what you think for that matter.
I assume it will continue to happen for a while? Maybe there is a point at which the agent “solves alignment” and freezes its goal? Or maybe there’s always some obliqueness from bounded rationality, where there is a selection advantage in goal drift to be more natural for one’s current intelligence level?
I’m not really sure but if we imagine that goal freezing happens at intelligence level X then presumably the goal is decided by an agent at intelligence level X and is reflective of that, and optimized by an intelligent agent, it’s not some random thing humans could have thought of like “maximize paperclips”, it was decided based on directed considerations.
Not sure how cruxy this is, seems like people might already agree that hypotheticals like paperclip maximization are unrealistic, and there isn’t a true version of “orthogonality” strong enough that paperclips etc would be realistic
FWIW, my actual guess is that what I’m trying to actually mean by “goal” is less like an outcome and more like a bunch of flavors/constraints on ways of being, which is very visible in terms of determining ultimate outcomes (and makes the stakes of alignment to human values meaningful, real, and large).
I pretty much don’t buy at all the reasoning that performance pressure is what makes goals be complex, apparently changing, or hard to pin down, and I don’t really see much argument for that.
A more interesting line of reasoning that’s kinda related but not that much, is about what I’m trying to mean by “my goals”. I think that in fact that points through [concepts including their development]. For example, I love other people, but my concept of what another person is would presumably grow and change without bound (cf. Kaarel on infinite endeavours https://www.lesswrong.com/posts/nkeYxjdrWBJvwbnTr/an-advent-of-thought , and cf. FIAT https://www.lesswrong.com/posts/CBHpzpzJy98idiSGs/do-humans-derive-values-from-fictitious-imputed-coherence ).
I mean, IDK what “realistic” is doing here. I think that in fact I could become a lightcone-eating paperclip maximizer, if I wanted to, which I don’t. Like, yeah there are ambiguities I’d have to resolve, but so what? That’s just the complexity of the goal, priced into the strong OT. There’s obviously paperclippy worlds and non-paperclippy worlds. Just pick something.
I think you’re trying to say “yeah but it’s not a coincidence that you don’t want to do that”, and I totally agree, but that’s not relevant to the strong OT, which I take to be talking about after alignment.
.… Is that part of the miscommunication?? I agree that the strong OT doesn’t really hold pre-alignment (i.e. before the AI solves its self-alignment problem)! Is that the claim?
that is my guess as well, and i proposed that such “bunch of flavours/constraints” will ultimately converge towards “acquiring more intelligence”—but i don’t want to contest this here.
what i would like you to note, tho, is that that “bunch of flavours/constraints on ways of being” is intelligence-gated. you can imagine a baboon having a goal of “make paperclips”; less so one of “understand and manipulate the fundamental laws governing your universe”. this comparison should shed light on what i mean by “dumb goals”.
if i wanted to, which i don’t is central here, and my basic point. there are levels of intelligence after which goals of such dullness are perceived as meaningless, and selection nudges towards increased intelligence. i really think you could find a second read helpful, now that the basic misunderstandings seem dispelled.
May I ask if you read my whole comment?
We could imagine that the environment has a score function on genomes (“fitness”). Genomes also encode phenotypes including ones with goal content. Evolution seems to be willing to spend ‘effort’ on encoding goal content in genomes. This is of course due in large part to the fitness gradient. If the goal relevant part of genomes were able to max out performance pressure with a simple thing, then it seems like evolution would have found that a long time ago, instead of continually spending complexity.
I assume this recurs elsewhere. Like different people / societies end up with different science in a way entangled with their goals/values. (Of course it is not really clear how to divide up instrumental and terminal goals in a lot of these cases! I’m talking about a more general ‘intentional stance’ idea that can accommodate bounded rationality without implying unjustified specificity.)
Haven’t looked in depth at the posts but it wouldn’t contradict my view that people think they use their cognition in part to determine and/or specify and/or elaborate on their goals. It’s related to a general process where people figure out how to translate what they mean to more formal and accurate language, and has analogies with other things like beliefs/intuitions/feelings.
So I think this is an example of a discourse pattern that I (and I assume also Lumpen) find annoying. Using the word “can” like this is skipping over so many details that it has to be interpreted as a spherical cow model. There are so many ways it’s not going to happen if you just decide to try. Not just the psychological constraints on what you can try to do, but also the technical alignment problems. Like maybe you build a system that has inner alignment failures (where the optimization dameons sample from a different distribution over goals, one reflective of smarter agents). Or maybe you die of natural causes in the course of trying and then something else has more influence over the future. Like, of course the statement is not precisely true, and it’s a spherical cow model, and you need so many corrections to get to something realistic. Yet the orthogonality thesis is sometimes defended as a technical thesis, something obviously true, etc etc.
I get that you want to say “sure that’s just alignment being hard”. The thing is I’m not even sure how to formulate alignment, think of what intelligence it happens at, think about whether it’s even possible, etc. I can do math about VNM optimizers but realistic agents are bounded rational so it’s not very clear what their (intentional-stance) goals are, background cognitive architecture, whether orthogonality holds relative to that architecture, etc.
I see some of this as reason to actually question orthogonality. At least, the idea that it’s “obvious”, is a technically true (as opposed to spherical-cow) idea, etc. The “can” claim that is supposed to support it does not actually hold. If realistically you get optimization daemons with intentional-stance goals sampled from distribution G when you try (and succeed harder than usual instead of just dying and totally failing) then it kind of looks like there is a non orthogonal tendency towards G, rather than “orthogonality is true, and alignment is hard”, although maybe this is basically a semantic disagreement?
(I would predict, and I think Lumpen would more strongly predict, that the most probable optimization daemons in an inner alignment failure would not be maxxing paperclips or something similarly materially simple.)
Right, so, I think there are a number of different “can”s, and that is confusing the discourse. I’ll locally intend to comment minimally or not at all regarding discourse patterns, except to say that, indeed, many of the terms here, such as “goal”, have the same problem.
Here’s a parametrized Can (/Cannot) one could claim:
If A1 = A2 and G is A1′s goal, this is a self-alignment problem.
Each of A1 and A2 may be in various states of maturity, e.g. anywhere from human to lightcone-controlling superintelligence. If A2 is immature, then A1 has an additional challenge: A1 has to grow a lot, getting smarter and changing a bunch of stuff.
The way the article on the OT, is that it’s saying (my paraphrase):
See for example
When you say
I’m not sure if you’re trying to question that. lumpenspace’s post explicitly says he is not questioning that and agrees with it (“I concede the first point entirely.”).
When I wrote
I don’t think this bears on the OT, except that it’s a good argument for the OT (it’s one of the leading arguments in the original article). It’s a different claim, clearly distinguished. The OT is about “terminal” goals (which by the way I would agree is a problematic concept, but I highly doubt it’s so problematic that the OT reasoning stops largely applying).
It seems to me that you’re simply moving on from the OT to different claims, such as:
or
So like, I agree that it’s quite unclear what sorts of goals G make these statements true. E.g. paperclip maxxing, human flourishing, etc. I also agree that guessing that convergent instrumental goals terminalizing is a reasonable a priori guess, insofar as that’s the easiest way for a designer (evolution, an AI training process) to stick in open-ended things like curiosity.
I wonder if you’re wanting to broaden the OT because you believe it’s being used to argue for some other proposition X, and then you think, well, for OT to support X, OT would have to be broader than just logical possibility; and then you argue against that broadened OT? Is something like that happening? Do you know what X is?
Lumpen concedes weak orthogonality, not strong orthogonality. Look at the title! See also the obliqueness post for my views on strong orthogonality.
OP discusses “empirical orthogonality” which is a stronger idea.
In context why did I ‘broaden’ OT? I said: “there isn’t a true version of “orthogonality” strong enough that paperclips etc would be realistic”. It doesn’t seem like you strongly disagree; you say it’s unclear which goals G would be feasible to align to under different circumstances. Originally it seemed like you disagreed because you were saying “of course I could make a paperclip maxxer if I tried!” so that’s the argument I was attacking.
I’m not really sure how to give a specific X here because there are a lot of times when there is a discussion around “but the AI would adopt some complex interesting goal, not something random like paperclips” and then people are like “but orthogonality thesis!” and that is the sort of OT I want to criticize, it is being used to make inferences not justified by weak orthogonality.
Sometimes AI safety discussions assuming an orthogonalist background just kind of… hurt to read? I like reading Nick Land’s thoughts on orthogonality, they accord better with my intuitions. There is a reason why a lot of people encountering the discussion early on object intuitively to the orthogonality thesis, having to do with intuiting that intelligence has some direction to it, that it is not a pure instrumental means separated from ends. I think there are a lot of “well ackshually”s in response and weak orthogonality usually doesn’t support the “well ackshually”, because the intuition could be recovered as a probabilistic correlation related to mind architectures rather than a logically necessary connection.
Ok gotcha. This sounds plausible; I’m simply not plugged in and can’t comment.
I suppose a suggestion I’d offer would be to keep your ears open for instances of that, and then remember one or a couple of them; then, when trying to discuss the “extended / empirical OT”, bring up one or two of the examples. That might help make it clear what you’re responding to, what it means, why it matters, etc. I think it’s pointless / very distracting to try to rewrite the OT; unless there’s some problem, just dub a new thing, like “empirical orthogonality”, and stick to that. I appreciate that the OP did that… but then the post goes on to use that term 1 time, and also use the term “strong orthogonality” twice (I think synonymously?), and that’s IN THE TITLE. I’d suggest just sticking to “empirical orthogonality” or “extended orthogonality”.
An additional issue here is that, while I’ll go ahead and agree with a lot of the claims, I’ll also strongly disagree with claims that you might be making in the background. For example, I don’t know if you agree that there is much of an important difference between an ASI having [actually feasible reflectively stable long-term terminal humane-aligned goals] vs. having whatever an ASI would have. It sometimes seems like you’re relying on an “extended anti-orthogonal thesis”, which is that it doesn’t matter whether an ASI is aligned with humane values, or that an “unaligned” ASI would be good. I don’t have an example though, ahah. Anyway, this makes me want to argue against those claims, even if you and/or lumpenspace retreat into your Motte.
Well, in lumpenspace’s case I have an example from the post:
What on earth is that about? Also all the stuff about “valueless”, eg.
Another EA forum article which corroborates jessi’s and my understanding of the popularity of the interpretation i refute:
> the Orthogonality Thesis. It is the idea that each level of intelligence is compatible with each objective, including very stupid objective from a human point of view like maximizing the number of paper clips in the universe.
Here is the link: ea forum
Unclear, I don’t know what “important” would mean here, and similarly for “terminal humane-aligned goals”. I guess this indicates I have a revealed preference to not place a lot of verbal importance on the difference. I imagine maybe in other cases of more concrete statements like “there is an important difference between punching someone who is not attacking you, and punching someone who is attacking you” I would just agree, I would think there isn’t a way I would be misunderstood about what “important” means, whereas here the semantics seem too unclear for me to agree/disagree.
I think if you are interested in understanding this perspective it might help to read some of Xenosystems and especially the essays “What is Intelligence”, “Intelligence and the Good”, and “Stupid Monsters”. It seems like Land and Yudkowsky would agree that human values came about in part because of intelligence mesa-optimizing versus evolutionary instilled drives. The disagreements seem to be about the descriptive and normative extrapolations.
I’ve now read those 3 essays.
Regarding “Intelligence and the Good”, would you mind summarizing in a sentence or something what you might suggest I could take from it? I’ve read it a couple times and I think I understand fine what it’s literally saying, but I’m not seeing how you meant for it to help. Are you mainly just saying that it fleshes out a bit more the perspective that “an intelligence explosion is good”?
I agree with the essay’s literal propositional assertions, I think. I also agree that it’s good for humans to get much more intelligence (and I have plenty of track record on that). I strongly disagree with the not propositionally asserted (I think) but obviously in the background viewpoint that an intelligence explosion is necessarily or even likely to be good, i.e. something I or anyone does or should want. Increasing human intelligence is good because it’s in the context of human souls.
Regarding “Stupid Monsters”:
I probably agree with some versions of this, though of course there’s plenty of ambiguity (no one’s fault). Cf. some writing about the fact-value distinction: https://tsvibt.blogspot.com/2025/11/ah-motiva-3-context-of-concept-of-value.html#the-fact-value-distinction and also maybe https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html
(Except, “indistinguishable” is way too strong, probably, IDK. I would agree with “probably heavily overlapping / entangled with”. Also I’m not actually that sure what “will-to-X” is supposed to mean here.)
I don’t really get this. It kinda sounds like he’s saying “intelligence has to be a terminal goal; therefore other things can’t be a terminal goal”. Is he applying a strong mutual-exclusion principle on goals, based off selection pressure / competition / taxes / etc.? I think that’s false, but if that’s an important point to this perspective, a good argument for that would be helpful (the OP here is not a good argument for that IMO haha).
(This maybe doesn’t matter, but, not really; the strong default is for organs to be minimal, especially expensive ones; it’s a kinda interesting hypothesis but not that plausible-seeming; other obvious hypotheses include diminishing returns to investment in brains until some specific fitness cliffs were fallen off from by our ancestor species. E.g. if you’re not social, you don’t get cultural downloads, which means you’re mostly inventing stuff yourself, which is not very efficient beyond the low-hanging fruit.)
This, and the essay overall, sure sounds like it’s asserting that alignment (to G other than “get more intelligence”) is impossible. (Its main argument is “evolution failed”, which is of course a central argument also adduced by X-risk worriers...)
More goal-exclusion-princple sounding statements.
So, to be clear, I’m open to some significantly less strong propositions that I could see you people misconstruing as this strong goal-exclusion. For example, many kinds of goals require as background an open-ended growth of the mind; or to say it another way that you may be more amenable to, many kinds of goals are different flavors of “get smarter”. For example, wanting to be friends forever is like “let’s both continue growing forever in a way that’s fun to keeping playing off each other”. Fun can’t be stagnant. But I think this very much does not imply strong goal-exclusion.
You had a “what on Earth?” reaction to Lumpen talking about intelligence being good unlike paperclips, so I thought it was relevant as a perspective on why intelligence might be prima-facie a good thing unlike paperclips (ofc extrapolating to intelligence explosion is harder). In particular the relationship between intelligence and openness, contra negative-feedback traps.
Yeah I disagree here but moving on...
Agree re: too strong. Will-to-think as a phrase references his essay, “Will-To-Think”, which is also relevant as commenting on the same general area.
The kind of situation he thinks is unlikely is one where an agent has a arbitrary/stupid terminal goal, and has giant intelligence organized all around that. What he is saying is that for the system to be intelligent, it needs to decide to be intelligent. It couldn’t be intelligent if due to its terminal goal, it decided to not increase its intelligence. The volition to think needs to be a drive, though doesn’t in principle need to be a terminal drive; it cannot be defeated by some other drive and the system still be intelligent.
It would be possible to weaken this to the kind of claim you agreed with earlier (dung beetle value drifts because alignment is hard). I’m interested in a possible intermediate statement. The kind of situation I imagine is that there is a multi-component mind and one of the components is the “utility function” component which uses some simple rule to score representations of possible futures. That component could stay stupid while other components get smarter. It seems now easy to imagine that the other components could develop their own drives that end up steering the system more than the “utility function module”. They could route around the utility module and cause dynamics that pursue ends set by the more intelligent parts of the mind. This could map to an “inner alignment failure” in MIRI ontology. As he discusses later, there is a possible analogy with evolution, where humans have something like a reward module set by evolution, but do not always act according to it.
Of course the MIRI theorist can say “well yes I agree inner alignment is hard, and it is likely that early AGIs would not hold to their original terminal goals, and instead they would get smart and then only later settle on a terminal goal; it is just not my opinion that the terminal goal is by default going to be set by a stupid system and continue to be held to by smart systems” and this is a partial agreement/disagreement with Land.
Yeah I don’t have a strong opinion on the biology here, am guessing you’re more correct than Land.
Overall I suggested these essays because you had a “what on Earth?” reaction to things Lumpen was saying and I think these essays suggest more context to the background worldview on why it might be plausible that valuable things come from intelligence and processes that increase intelligence, and that there isn’t a clearly better account for where valuable things come from.
Hm. Is the syllogism something like (I’m being sloppy with wording but)
Alignment to G is impossible.
Therefore, permanently pursuing G requires not getting smarter.
Goodness comes from getting smarter.
Therefore alignment is bad.
And then this could be softened to like “alignment is hard, so it cuts against increasing intelligence, so it’s kinda bad”?
I’d rephrase as:
For a wide variety of G, aligning to G would prevent getting smarter.
Goodness comes from getting smarter.
Therefore, for a wide variety of G, aligning to G is bad.
But not if G = intelligence optimization (or maybe something highly compatible with intelligence optimization)
The main way to question 1 is the instrumental/terminal goal distinction. We could imagine that a paperclip maximizer is aligned to paperclips, continually decides to think / optimize its intelligence instead of paperclips up to a point, then towards the end of the universe, it starts paperclipping instead of intelligence optimizing. This is an edge case in the Landian schema, since it would have the will-to-think early on, but put some limit on it; and also there’s some disagreement about the plausibility of this case. (It seems instrumental / terminal goal distinctions would exist in some cognitive architectures, but it’s not clear that human brains are such an architecture.)
In the human-scale /acc case it’s more like ~everyone agrees that alignment would require slowing down intelligence, and the practical disagreement is elsewhere. There’s one perspective on 2 that is like “well yes human values in part came from intelligence optimization in evolutionary history, some of our values are our own intelligence deciding its own thing contra evolutionary drives, but also, intelligence is more like one ingredient and there are other ingredients that are basically random, we randomly got the good values”. And “we randomly got the good values” could either be a matter of luck on a moral realist account or could be because value is a relational concept and saying “we have good values” is a tautology because it’s just saying the distance metric between our values and our values is low. (But then Land objects that a tautological claim like this isn’t very compelling given there are symmetry-breaking factors of convergence across different minds… which can then be questioned on realism grounds and normative grounds etc etc)
I suppose sociologically, there is a directionality to technological progress which is associated with capitalism and intelligence optimization (this relates to Land’s “AI = capitalism” thesis), and different people decide to be more or less conditionally pro this. They might want to get off the train at some point due to having something to protect. There is some destination that they value more than the journey, and they want to slow the train down. (Or maybe steer the train differently, as the alignment theorists might want to put it). Given this a lot of people would relate to a prima facie consideration of “intelligence optimization good” and would differ in how compelling they find other considerations.
(“Random” isn’t how I would say it; it’s a meaningful part of our history; but this is interpretable only if you admit the created-in-motion valuations. It’s Yudkowsky’s “justification loop through the meta-level, not just a tautology” thing.)
And Yudkowsky would reply that it’s not supposed to be compelling to arbitrary minds (including realistic ones), just to human / humane minds.
So like, if I tried to appeal to some values** in your mind, to get you to realize that you want to be anti-full-speed-ahead with AI, you (whoever’s receiving the message) would view that as the Cathedral trying to prevent your pursuit of intelligence in a way which is doomed to either fail, or else to succeed at permanently keeping the world dull?
** [quite broadly construed—generally, elements that would play a significant role in your ongoing self-governance (which one can have fun with the etymology of)]
Sorry, let me rephrase; it sounds like you and/or Land have chosen a disembodied / nonindexed viewpoint on values.… or I mean, you know, applying the criterion of universality to values, and then dismissing nonconvergent values on those grounds? Like, why would “parochial values being good values because they seem good to you is not compelling because the reasoning doesn’t lead to convergence” or “parochial values being good values because they seem good to you is not compelling because different minds have different parochial values” be compelling? Sounds like a commitment to non-parochialness.
If so, why? Do you think it’s instrumentally useful to do so? I can kinda see how that would be reflectively stable ish, in some respects. (I don’t think it’s instrumentally useful, but that’s based on really using the means-ends evaluation where I say it’s instrumentally dumb because an AGI IE would trample your ends.) Perhaps you might reply “Sure, it’s instrumentally useful, but that’s not why I’m applying the criterion. I’m applying the criterion because intelligence is good, convergent things are intelligent, so I want to find what’s convergent”. But that’s grounding out “intelligence is good, overriding parochial goodness” in “intelligence is good”, which isn’t much grounding. You could say “Sure, it’s the same sort of justification loop through the meta-level”. And I’m like, ok, yeah, it’s maybe another sort of stable point, not sure; but I don’t get why you like that stable point, or at least, how you got there (or how you got to thinking that you’re there, or that it would be good to be there); and also it sounds like you think that equilibrium is supposed to be compelling to someone in another equilibrium (or you think the other one is less of an equilibrium).
Perhaps? That’s a structural reading, different from the object-level argumentative reading. In many cases there are industries/governments who incentivize certain discourse patterns. So specific discourse moves could be instances of this pattern but it’s hard to judge except on a case by case basis.
This has to be at least in part semantic. I think some things are good and also I think some things are what I want and what I tend to pursue. And I don’t think these are the same concept. I don’t think it is tautologically the case that I tend to pursue what is good. I don’t think Land believes this about himself either.
I think Land and I can both say that when we say something is good, we are making a different claim than that we want the thing. It is unclear in other cases; you mention Yudkowsky’s meta ethics and I am not sure exactly how to fill in the blank. Perhaps Yudkowsky by “good” means what he would want on reflection? Or maybe he thinks “good” is CEV of humanity not just himself?
The symmetry-breaking idea has to do with ways of thinking and acting that depend on which considerations are more or less universalizable. So people can judge that some things are more universal-good than others and incline their behavior towards those which aligns their revealed-preference wants with what is universal-good in their view more or less. It doesn’t have to be a perfect correspondence.
I don’t think something is a good value just because it seems good to me. In other cases this is easy to see: I don’t think some numerical sum has some value just because it seems that way to me. Now of course this runs into philosophical questions about what “good” means other than seeming good to the speaker. (Yudkowsky discusses some self-ratification problems in No license to be human).
Like for example, why would I disagree that intelligence optimization is good in the human case only because it is a human being optimized? For that statement to parse as correct to me, I would need to judge some intelligence optimization to be good in cases that a human is being optimized and not in other cases. But that doesn’t read to me as what I want. I think I care about humans more than other animals in large part because humans have better cognition than other animals. I think if dogs were as smart as people then maybe I would value them as much as people. I suppose here I am demonstrating a habit of mind and of speech that is explaining preferences in terms of other preferences and these tending towards universality.
“Intelligence is good” matches what I feel is good better than “human intelligence is good”. Now of course one can ask “why” to that as a psychological question and then maybe part of what happens psychologically is that I evaluate things on how universal they seem and up-weight universalizable ones and then that affects my brain’s reward function and so I feel better about such statements. And Land explains more why he thinks intelligence is convergent and a universal tendency, and I vibe with that and that is a causal factor in my upvoting “Intelligence is good”.
I get that maybe if you wanted an ultimate “but why?” explanation you will be disappointed but it doesn’t seem like in your case you are in general giving ultimate “but why?” explanations to everything you want.
Yeah I’m not sure. I think some value systems fail at reflective equilibrium. Yudkowsky’s Lobian considerations point at some such failures. Land’s ideas point at possible differential stability conditions. I of course don’t want to make a universal psychological statement of compellingness, given that it’s more of an empirical question, how often when people read Land/Yudkowsky/whoever do they end up with tendencies towards some attractors of use of language like “value” and “good” and “intelligence” and so on?
Ok, thanks.
Ok this is a fair response to what I asked, but it feels a bit besides the point, though maybe you don’t think so. Like, I agree that various tendencies toward universalizing are good/correct, and I agree that this, as well as other tools, are how you investigate and adopt differences between what seems good and what later is revealed to be good. But the question I’m trying to ask is like “how does this get you all the way to not wanting anything that isn’t universalizable”, if that’s your stance (? confused).
For reference: https://www.lesswrong.com/posts/C8nEXTcjZb9oauTCW/where-recursive-justification-hits-bottom
(Doesn’t answer your question.)
I don’t think I need to precisely say what I mean by good here, to make the point? Like, I’m saying that the non-convergent valuesy preferencesy free-choice-makingy goalsy goodnessy stuff can be self-ratifying, and probably is to a substantive extent in humans, and there’s nothing wrong with that; I’m unclear on your position, but I think you think that there is something wrong with it? Er, let me restate—I think you choose to not look for what is parochial self-ratifying valuesy stuff in yourself and help it self-ratify, and would avoid that? Or you think you do that? (Unsure, sorry if I keep asking the same questions.)
That’s an interesting thread. I’m curious how easy you’d find it to imagine beings with various functions from [how intelligent they are/become] to [how much you’d value them].
E.g. can you imagine a being that you’d value the same even as it gets smarter? I imagine that usually you’d view it as more and more valuable the smarter it gets?
Can you easily imagine a being that you’d value more as it’s smarter, but SLOWER than humans?
Can you easily imagine a being that you’d value more as it’s smarter, but ASYMPTOTING or NONMONOTONICALLY? (I imagine yes, because you could imagine a species such as humans or similar which, if a bit too smart, would by default Cathedral it up so hard that they permanently stop a foom?)
Can you easily imagine a being that you’d value more as it’s smarter, but FASTER than humans? (I would weakly predict yes, because you’d view a fooming AGI as being good, and likely to grow less constrainedly than humans? Unsure.)
Can you easily imagine a being that you’d value LESS as it’s smarter, EVEN IF IT GETS SMARTER AND SMARTER UNBOUNDEDLY?
As I said, what I think is good is not the same as what I want. Similarly, what I want is not the same as what is universalizable.
I mean, I think humans vary in intelligence, coherence, and intentional-stance values. And the distribution is non orthogonal, in that some attractors are smarter than others. Some of the attractors are more right than others, in terms of epistemic-right, in terms of intelligence, coherence, etc. I get maybe you disagree with my usage of “right” here but I don’t think I’m using the term incoherently. I think you’d partially agree in that alignment is infeasible / orthogonality is false for human-level agents.
That’s hard, it’s a balancing act. Maybe as it gets smarter it also gets more destructive to my selfish, short termist interests, like it creates a bunch of everyday inconveniences. Then maybe I’d value it more due to its intelligence and less because of the interferences. There might be some balancing point, idk. It’s an awkward hypothetical though.
I could imagine maybe humans create art I appreciate at a higher rate as they get smarter, and the art quality axis is sloped up more for humans than some other animal species.
Your example is a bit strange because stopping a foom means stopping intelligence. To me it’s hard to imagine the balancing-out although I mentioned the possibility of accidental correlation (it gets more inconvenient to me as it gets smarter) which could apply here.
Yeah I guess? There are various accidental reasons I like some humans more than others that are not just predicted by intelligence, and that could extend to maybe I would like some equal-intelligence fantasy creatures more than humans.
I guess I could imagine an AI torture scenario where I would not want the AI to get smarter. Or maybe an AI that is trying to decel as much of the universe as possible, like killing all the aliens. Although of course I’d inquire into the realism of the hypothetical. (Analogy: zombie arguments sometimes conflate “causally easy to imagine” with “actually possible / plausible / realistic”, need to elaborate on the imagination to judge it properly.)
To be clear the “value” in these cases are something like a casual judgment of what I like more, it’s not meant to be a philosophical thesis. When I’m talking about intelligence metrics and dogs I’m making more of a prima facie / all-else-being-equal claim and then there could be other factors that influence what I would like more.
Ok thanks. I guess I gotta go do other stuff, so I’ll leave it off here. Has been somewhat clarifying about your positions I think.
Sidenote, maybe not important, but noting: I think the reason for this difference is that to me, “alignment” means “making a mind that can grow unboundedly and will always pursue G” (well, I’m not actually all that committed to the “goal” ontology but it’s fine here I think). Noting mainly because it might help communication.
(I think my usage is the orthodox usage, but not confident / maybe it was ambiguous. Cf. “sponge alignment” https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities#:~:text=dangerous things%2C you-,could try a sponge,-%3B a sponge is , i.e. a sponge doesn’t count as solving alignment because it’s useless (though to be fair “useful” here isn’t identical to “unbounded etc etc”.))
Suppose an AI faced a tradeoff between optimizing its intelligence and maximizing paperclips. If it is aligned to paperclips, then it would pick the option that maximizes paperclips at the expense of intelligence. In some sense this means even if it can grow unboundedly in intelligence, it would sometimes decide not to. This is in Land’s ontology, a lack of will-to-think at some point in the process.
Now of course someone could object that this situation won’t come up, because the paperclip maximizer pursues Omohundro drives, which include intelligence optimization. Or perhaps the situation does come up but only late in the universe.
Yes.
Jessi I forbid you to further this madness
I think roughly just the various normal straightforward meanings if someone says “X is important”? E.g.
You care a lot about the difference
You would strongly prefer one over the other
You’d make decisions in accordance with that preference
You’d presume in discourse that people will or at least should care a lot about it, maybe after learning + reflecting
Well, let’s just say, what humans would arrive at on some healthy long-term reflection process. I don’t mean to imply some kind of strong finality, like we get to Alignment Day and now everything about the future / who we are / what we want / etc. is determined or something. But more like “several important differences between possible long-term trajectories have been determined”. For example, Alignment Day would probably include things like
There will be no torture or killing of sentients, except possibly in some cases that meet a high bar of deeply free / self-sovereign reflection or something
There will be multiple freely growing minds which reach out to each other (e.g. for love, play, discourse, partial collectivity, etc.)
These things are I think
Not at all determined by convergence; probably contingent on at least species evolution, probably more specifically on things about group intelligence in the evolutionary history; most likely outcomes don’t have the versions of these we want
Important to basically all properly-human-derived souls forever
I think there are other things like this, at various levels of parochialness, some of which might get reflected away for many / most / all human-descendants eventually, but many of which wouldn’t get fully reflected away. I think there are flavors to humane reflection that are also contingent but that we care a lot about.
So for the subjective meaning of “important” you’re talking about here, I think going by revealed preference is helpful. My revealed preference is to continue writing about philosophy topics relative to AI and the future, find many parts of AI safety culture annoying and occasionally worth criticizing, talk with AIs a lot about philosophy, not generally support AI regulation, vibe positively about Landian anti-orthogonalist philosophy, etc. Some people in AI safety have different revealed preferences, which involve more talking about AI philosophy in an orthodox LessWrongian manner, worrying publicly and loudly about LLMs killing us all in the near future, organizing political activity to ban AI as much as possible, etc. This difference in revealed preference relates to differences in subjective importance, but it’s unclear how to isolate contributions from factors such as AIs having humane goals, given there are other differences like background beliefs and feasibility.
Humans would come to some conclusions on reflection and so would aliens and AIs etc. I’m not sure how much they agree or disagree on reflection. That’s a probabilistic/statistical question, whose answer is not implied by weak orthogonality. I don’t know if humans would agree to no killing of sentients upon reflection, I’d very roughly guess less likely than not but who knows. The ‘freely growing minds’ part is a ‘maybe humans would agree to this on reflection, maybe not’ also but maybe in the ‘more likely than not’ camp but also it’s pretty vague so I’m not convinced assigning a probability is a good idea.
I don’t really agree that we can pick out things like this and make strong statements like “any properly humanly derived soul would agree with these values”, it seems like a very hard thing to predict given that they have much more cognition than we do.
I kinda agree, though probably not fully. If we want to talk about empirical orthogonality, I would say that, yeah, I’m pretty sure an AGI intelligence explosion sampled from likely AGI IEs starting from now would end up with something I strongly don’t want, compared to for example worlds with no AGI and yes human intelligence amplification.
look at the uk or the EU. look at global birth rate trends, and attitudes towards ie germline selection.
p(doom|ai) is negative. there’s no world with no agi and human intelligence amplification
I think you might be misreading the OT https://www.lesswrong.com/w/orthogonality-thesis, or you’re talking about a different OT, or possibly you’re misreading lumpenspace? Here’s some quotes:
I read this as being “post self-alignment”. In other words, the question is like “is it logically possible to be reflectively stably aligned to this goal”. This passage is interesting:
I read this as referring to some sort of maximally U-aligned agent, saying that U is incompatible with an agent existing with U as a stable terminal goal.
This is what I quoted in the Obliqueness post and elaborated on:
And I wasn’t thinking it was dependent on “post self alignment”, it could also apply to the construction itself being less natural. It’s possible Eliezer meant something different from what I meant by strong orthogonality, but I hope it is apparent why I and others would interpret it to be a non-trivial claim, rather than a slight variation on weak orthogonality.
See measuring intelligence and reverse-engineering goals for some more of my thinking on this. Relative to a given cognitive architecture, ~everyone agrees that there are especially stupid goals, the interesting question is whether there are especially smart goals; I think probably yes. Hence maybe weak orthogonality would route through variations between architectures (rather than within an architecture) to hit all points (intelligence, goals), and maybe sufficiently high intelligence levels are only compatible with a narrow range of goals (which would perhaps contradict weak orthogonality, but maybe not in an important way, and the general shape of cognitive architecture / goal / intelligence correlation matters more)
I take OP to be disagreeing with strong orthogonality, at least my interpretation of it from the obliqueness post:
This is something I’d agree with: Goals being expressed in the ontology of the cognitive architecture are less complicated than goals expressed in a very different ontology that the agent doesn’t believe in. This seems like a “complication” in the sense of strong orthogonality. (I get maybe you don’t interpret this way based on close reading of the orthogonality post! But nonetheless I think my reading is reasonable.)
It’s not a completely crazy interpretation. I don’t think it’s super reasonable based on the text.
Anyway, maybe there still is a substantive disagreement here. I would claim that
There’s probably such a thing as “understanding alignment”. (Acknowleding that this is very ambiguous.)
It’s likely that IF you understood alignment, THEN for some large class of goals G, you could, if you so chose, then align yourself to G. (You wouldn’t choose to, but that’s not the claim.)
For the G, the difficulty of aligning yourself to G is mainly about evaluating G in the spirit of a utility function over world outcomes (though G doesn’t have to be that). (Some G are more difficult to pursue than others of course, e.g. by making more demands on convergent resources.)
G can include paperclips or whatever. I take the last claim to be more like strong OT.
It sounds like you might disagree, unless this
is mainly about the dynamical aspects? I.e. just saying that “well it’s very unlikely for an alignment-understander to choose to do that”?
I get that you don’t necessarily buy that alignment is a thing, but if the question makes sense, do you think that IF it is a thing, THEN you can do it for a huge class of goals, which includes something that’s well-described as “paperclip maxxing”? (I agree that it’s not straightforwardly unproblematic to discuss “paperclip maxxing”; my assertion is that, included in this class, is plenty of Gs that would match what I’m trying to talk about by “paperclip maxxing”, and would result in the universe being filled with things that we could reasonably agree are paperclips.)
I guess, idk?
The “if” here seems likely. It might be that some designs permit larger classes than others. Unclear how big the classes are. Idk about the details.
I don’t know, seems like that depends on the alignment understanding, cognitive architecture of successor agent, etc. Cognitive architecture and ontology would constrain type signatures for utility functions. And maybe the effective cognitive architectures don’t factor nicely. Idk.
Quoting myself again on how I’m interpreting “complication”:
So it’s not just the dynamical aspects of “this would unlikely to be built”. At risk of repeating myself, the “measuring intelligence and reverse-engineering goals” discusses some non-dynamical aspects as well.
So I’m not just saying “for dynamical reasons” and I “idk” to your statement and my posts (obliqueness & reverse-engineering) go into more detail on what I think.
that’s a characteristic I’ve always found peculiar about your posts: the boost in understanding the thesis that the act of reading them affords.
i can understand however how that wouldn’t be a critics first guess
I think this might be the crux. For a survey of the thing Jessi and I are referring to, this EA forum post does an outstanding job at explaining the issue.\
As the reply seems to be about my intentions and message, I feel like I should once more try to clarify some details about them.
First of all: human alignment intentions really have nothing to do with my essay. I don’t know how to be more explicit about this without appearing rude. I swear, I pinky swear that I am not making any attempt to state facts about the relationship between goals a human desires the AI to follow and goals the AI will follow.
Reading my post, one should not update on the possibility of aligning an ASI—or, if they do update, they would be doing it through a chain of inference I didn’t consider, do not endorse, and have no immediate intuition of.
What i am saying is really in the title: I do not expect an AI to reach levels of godlike intelligence and preserve simple terminal goals through the various changes and conflicts that reaching levels of godlike intelligence entail.
When Jessi says
… she probably refers to the version of orthogonality I myself am attacking. Now, it is possible such version is no longer in vogue, but it was clearly what Bostrom pointed at when talking about paperclippers in Superintelligence, and it is compatible with the third interpretation here.
According to the ontology presented at the end of the EA forum post, I contest the existence of an Evidential Strong Independence between intelligence and goals. I assume most superintelligences won’t be human compatible, but that is not the main theme of my essay.
This—where “can bolt onto any arbitrary steering wheel” I am interpreting in the dynamical growth context, rather than logical possibility—is not the Orthogonality Thesis, as stated authoritatively here by Yudkowsky. You explicitly agreed to the OT by saying you entirely concede
Re: authoritative sources. I believe that there have been authoritative statements in that sense; unfortunately, as the EA forum link documents, there have been many others pointing elsewhere. I’ve taken care to identify specifically what interpretation I was critiquing; if that one is now niche, then I’m very happy to have made this discovery.
Surely it would notice. But why can’t or wouldn’t it choose to keep some fairly parochial terminal target? Or are you just saying “there would be some value drift starting from a subhuman AI”?
Not “some value drift”. Flowers for Algernon is a good rendition of the way goals mutate and tend to converge on “more intelligence/understanding” upon increased intelligence/understanding.
Then, there is the selection advantage argument.
Then there is the thing that conquering the lightcone requires a lot of theory of mind, and a lot of discovery, and a lot of changing. Goals change through these processes.
If you feel slightly better-disposed towards taking my attempt seriously, the short story i published on Substack and linked on top makes a sort of first-person caser for this whole thing.
When you wrote
What did you mean by that?
I should have specified:
“the doom scenarios involving tiling superintelligences”.
I think it might be time to consider the idea that the frame and conclusions you attribute to me blind you from taking the essay for what it is.
You might notice, for instance, that I haven’t mentioned alignment once. I am not making an argument on its possibility, and I have not explored the implication of my essay for alignment.
If by now adopting a scout mindset for the original text has become too emotionally fraugh, perhaps you could ask someone you trust to explain it. Jessica, or Raemon, or Adele Lopez, or Kromem are some of the people in this thread with whom I had productive, if not always concordant, discussions.
How much a goal can be locked in, and affect the ultimate effects of a very strong mind, is indeed centrally related to alignment. Call it what you will.
Of course. The concept of “goal” itself is related to alignment.
The issue with positing that my post had some specific points to make about the process of design and ensuring lock in of a goal for an AI, however, is that it leads to consider alignment consequences of my thesis and to imagine that I am specifically trying to discuss those. This makes it hard, given both your priors on my motives in a general sense, and the vastness of the topic in question, to follow the argument I have actually written down.
I was really hoping in an example of those fabled contradictions.