Julian Bradshaw comments on Moral Alignment: An Idea I’m Embarrassed I Didn’t Think of Myself

Julian Bradshaw 18 Jun 2025 16:43 UTC
3 points
−16
This is IMO the one serious problem with using (Humanity’s) Coherent Extrapolated Volition as an AI alignment target: only humans get to be a source of values. Sure animals/aliens/posthumans/AIs are included to the extent humans care about them, but this doesn’t seem quite just.^[1]
On the other hand, not very many humans want their values to be given equal weight to those of a mollusk. Hypothetically you could ask the AI to do some kind of sentience-weighting...? Or possibly humanity ought to be given the option to elevate sapient peers to be primary sources of values alongside humans via a consensus mechanism. It’s a tough moral problem, especially if you don’t assume the EA stance that animals have considerable moral value.^[2]
1. ^
  Consider a scenario where we have a society of thinking, feeling beings that’s only 1/4th “human”—it would be clearly morally wrong for the other 3/4ths to not be a primary consideration of whatever AI singleton is managing things. Now, arguably CEV should solve this automatically—if we think some scenario caused by CEV is morally wrong, surely the AI wouldn’t implement that scenario since it doesn’t actually implement Humanity’s values? But that’s only true if some significant portion of idealized Humanity actually thinks there’s a moral problem with the scenario. I’m not sure that even an idealized version of Humanity agrees with your classic shrimp-loving EA about the moral value of animals, for example.
  
  Maybe this is just a function of the fact that any AI built on general human values is naturally going to trample any small minority’s values that are incompatible with majority values (in this case hunting/fishing/eating meat). Obviously we can’t let every minority with totalizing views control the world. But creating a singleton AI potentially limits the chance for minorities to shape the future, which is pretty scary. (I don’t think a CEV AI would totally prevent a minority’s ability to shape the future/total value lock-in; if you as a minority opinion group could convince the rest of humanity to morally evolve in some way, it should update the AI’s behavior.)
2. ^
  What’s tough about giving moral status to animals? The issue here is that there’s massive incentive for minority opinion groups to force their values on the rest of humanity/the world by trying to control the alignment target for AI. Obviously everyone is going to say their minority values must be enforced upon the world in order to prevent moral catastrophe, and obviously a lot of these values are mutually exclusive—probably every possible alignment target is a moral catastrophe according to someone.
- habryka 18 Jun 2025 19:02 UTC
  66 points
  23
  Parent
  Man, whenever someone says this they sound to me like they are really confused between morality and game theory.
  The reason why you include only humans^[1] in our collective Coherent Extrapolated Volition is because humans are a natural coalition that is ultimately in control of what any future AI systems care about. It’s a question of power, and associated need to coordinate, not caring.
  You, personally, of course want exactly one, much narrower set of values, to make up the whole of CEV. Which is your own set of values! The same is true for every other human. If you care about other people, that will be reflected in your own CEV! If you care about animals, that will be reflected in your own CEV!
  Having someone participate in the CEV of an extrapolated AI is not about “moral status”. It’s about who you have to coordinate with to get a thing built that cares about both of your values. Animals do not get included in the CEV because we have no need to coordinate with animals about the future of AI. Animals will very likely be considered moral patients by at least one human who will be included in the CEV, and so they will get their share of the future, if the people in control of it want that to happen.
  1. ^
    Or maybe powerful AI systems that you are cooperating with
  - Buck 18 Jun 2025 20:18 UTC
    14 points
    10
    Parent
    I am sympathetic on the object level to the kind of perspective you’re describing here, where you say we should do something like the extrapolated preferences of some set of bargainers. Two problems:
    I think that when people talk about CEV, they’re normally not defining it in terms of humanity because humans are who you pragmatically have to coordinate with. E.g. I don’t see anything like that mentioned in the wiki page or in the original paper on a quick skim; I interpret Eliezer as referencing humanity because that’s who he actually cares about the values of. (I could be wrong about what Eliezer thinks here.)
    I think it’s important to note that if you settle on CEV as a bargaining solution, this probably ends up with powerful people (AI company employees, heads of state) drastically overrepresented in the bargain, which is both unattractive and doesn’t seem to be what people are usually imagining when they talk about CEV.
    - habryka 18 Jun 2025 20:42 UTC
      22 points
      9
      Parent
      I think this aligns straightforwardly with what Eliezer intended. See this section of the Arbital (now imported to LW!) CEV page (emphasis added):
      Why not include mammals?
      [...]
      Because maybe not everyone on Earth cares* about animals even if your EV would in fact care* about them, and to avoid a slap-fight over who gets to rule the world, we’re going to settle this by e.g. a parliamentary-style model in which you get to expend your share of Earth’s destiny-determination on protecting animals.
      To expand on this last consideration, we can reply: “Even if you would regard it as more just to have the right animal-protecting outcome baked into the future immediately, so that your EV didn’t need to expend some of its voting strength on assuring it, not everyone else might regard that as just. From our perspective as programmers we have no particular reason to listen to you rather than Alice. We’re not arguing about whether animals will be protected if a minority vegan-type subpopulation strongly want* that and the rest of humanity doesn’t care*. We’re arguing about whether, if you want* that but a majority doesn’t, your EV should justly need to expend some negotiating strength in order to make sure animals are protected. This seems pretty reasonable to us as programmers from our standpoint of wanting to be fair, not be jerks, and not start any slap-fights over world domination.”
      This third reply is particularly important because taken in isolation, the first two replies of “You could be wrong about that being a good idea” and “Even if you care about their welfare, maybe you wouldn’t like their EVs” could equally apply to argue that contributors to the CEV project ought to extrapolate only their own volitions and not the rest of humanity:
      We could be wrong about it being a good idea, by our own lights, to extrapolate the volitions of everyone else; including this into the CEV project bakes this consideration into stone; if we were right about running an Everyone CEV, if we would predictably arrive at that conclusion after thinking about it for a while, our EVs could do that for us.
      Not extrapolating other people’s volitions isn’t the same as saying we shouldn’t care. We could be right to care about the welfare of others, but there could be some spectacular horror built into their EVs.
      The proposed way of addressing this was to run a composite CEV with a contributor-CEV check and a Fallback-CEV fallback. But then why not run an Animal-CEV with a Contributor-CEV check before trying the Everyone-CEV?
      One answer would go back to the third reply above: Nonhuman mammals aren’t sponsoring the CEV project, allowing it to pass, or potentially getting angry at people who want to take over the world with no seeming concern for fairness. So they aren’t part of the Schelling Point for “everyone gets an extrapolated vote”.
      Responding to the other part:
      I think it’s important to note that if you settle on CEV as a bargaining solution, this probably ends up with powerful people (AI company employees, heads of state) drastically overrepresented in the bargain, which is both unattractive and doesn’t seem to be what people are usually imagining when they talk about CEV.
      Ultimately you have a hard bargaining problem here, but I don’t see a way around it. One of the central motivations for CEV has always been that it is a Schelling proposal that avoids accidentally destroying the future because we fail to coordinate, “all of humanity equally” is at least in current society the most Schelling coordination point, I think (and e.g. also kind one of the central constitutional principle under which things like the US are organized, though it’s not a perfect match).
      - Buck 18 Jun 2025 20:49 UTC
        15 points
        11
        Parent
        Thanks heaps for pointing out the Eliezer content!
        I am very skeptical that you’ll get “all of humanity equally” as the bargaining solution, as opposed to some ad hoc thing that weighs powerful people more. I’m not aware of any case where the solution to a bargaining problem was “weigh the preference of everyone in the world equally”. (This isn’t even how most democracies work internally!)
        habryka 18 Jun 2025 20:59 UTC
        24 points
        16
        Parent
        I think it’s the option I would throw my weight behind, largely because the difference between (as Eliezer says) “starting a slap fight over world domination” and “having any kind of reasonable weight allocation” is so enormously big by my lights, that I really wouldn’t want to quibble over the details.
        If there is another more Schelling option I would also be up for that, but I do have a feeling that as the details get more complicated, the ability to actually coordinate on any specific option, as opposed to fighting over which option it should be, by racing towards getting the god-machine first, gets a lot worse. The schellingness really weighs heavily here for me, and “each alive human one vote” seems like the most Schelling to me, though IDK, maybe someone can propose something even better and then I would also be happy to back that.
        Buck 19 Jun 2025 1:05 UTC
        10 points
        0
        Parent
        I think it’s very unlikely that (conditioned on no AI takeover) something similar to “all humans get equal weight in deciding what happens next” happens; I think that a negotiation between a small number of powerful people (some of whom represent larger groups, e.g. nations) that ends with an ad hoc distribution seems drastically more likely. The bargaining solution of “weight everyone equally” seems basically so implausible that it seems pointless to even discuss it as a pragmatic solution.
        habryka 19 Jun 2025 1:28 UTC
        11 points
        6
        Parent
        I feel like there is a very natural bottleneck for auditing here, which are the relevant AI instructions, and I think this heavily pushes towards simple principles.
        I find the alternative, that you would end up in a world where human values are successfully represented, but highly unequally, without a bunch of people freaking out and racing and then ultimately sacrificing the future, also pretty implausible. I think the default outcome in most of those worlds is that you don’t get any good agreement and consequently mostly just lose it in collateral damage.
        I think there is some chance you end up with lots of intermediate bargaining happening facilitated by semi-empowered AI systems, though my guess is those would also for alignment reasons, in good worlds, favor extremely schelling solutions above all the other options. Like, I don’t think Claude’s present personality is that much evidence about what will happen after a lot more RL and RSI, but it seems clear to me Claude would end up choosing some set of instructions that is that cosmopolitan.
        I also don’t really get it. Direct democracies exist. We have actually ended up in a situation where “one person one vote” is really surprisingly close to the reality of how we govern humanity. Why such complete dismissal of the idea of extending it one (relatively small) step further?
        sunwillrise 19 Jun 2025 6:50 UTC
        9 points
        −4
        Parent
        Direct democracies exist. We have actually ended up in a situation where “one person one vote” is really surprisingly close to the reality of how we govern humanity.
        Not by the benevolence of the butcher, but because of the self-interest of liberal and (mostly) Western governments. In our current regime, human labor and intellectual output are simply too economically valuable to waste, meaning types of government that maximally allow them to flourish (liberal, constitutional, broadly capitalistic) get an edge, small at first but compounding over time to become decisive. But it’s not logically required for this to continue into the future.^[1]
        I don’t claim to have a complete model here, of course. “Where do (did?) stable, cooperative institutions come from?” seems relevant, to an extent.
        But consider this as an illustrative example: the US famously implemented PNTR with China in 1999 and supported China’s accession into the WTO a couple of years later. Beyond economic matters and the benefits of greater abundance and lower prices, proponents of these moves, such as President Clinton and House Speaker Hastert, argued increased trade and development would expose China to the wealth and prosperity of the West. When confronted with Western culture and the superiority of its living standards, China’s population would demand genuine democracy alongside “decent labor standards, a cleaner environment, human rights and the rule of law.”
        And people mostly believed Clinton and Hastert! Their arguments really caught on. Indeed, people at the time looked at Japan and (especially) South Korea as examples of their thesis being proven correct. But as Matt Yglesias ably explained:
        This idea that trade, development, and democratization would all move together was always controversial. But from what I can remember of the debates at the time, even the sharpest critics of trade with China underestimated exactly how wrong Clinton would be about this.
        For starters, it proved much easier on a technical level to censor the internet than I think non-technical people realized 20 to 25 years ago. But what’s worse is that modern technology, especially since the growth of the smartphone industry, is basically a huge surveillance machine. In the west, that machine is basically used for targeted advertising, which can sometimes feel “creepy” but that I don’t think has a ton of real downsides. But in the People’s Republic of China, it’s been used to craft a more intrusive authoritarian state than the worst dictators of the 20th century could have dreamed of.
        It was precisely the rise of technology that empowered the few at the expense of the many, by breaking the feedback loop of reality → citizens’ beliefs → citizens’ actions → reality that had made “empowering the public” part of the government’s self-interest if it wanted economic growth. In the past, China had had neither public empowerment nor economic prosperity.^[2] Around the early 2000s, it was able to move towards the latter without needing the former.
        ^
        Also, there are historical counterexamples, a la Singapore under Lee Kwan Yew
        ^
        This cursory analysis skips over the changes under Deng’s regime, for purposes of time
        What links here?
        sunwillrise's comment on A Conservative Vision For AI Alignment by Davidmanheim (21 Aug 2025 19:00 UTC; 32 points)
        samuelshadrach 19 Jun 2025 6:43 UTC
        1 point
        0
        Parent
        Empirical datapoint: We don’t run referendums on whether to fire nukes.
        habryka 19 Jun 2025 6:54 UTC
        4 points
        2
        Parent
        Feels very different, since MAD means you really need the authority to launch nukes on a faster turnaround than a global referendum. But deciding what values you want to give an AI seems like it would require inherently much less time pressure (there might be human-created reasons for time pressure, like arms race dynamics, but I expect in the worlds where you are rushing forward so quickly that you have to make decisions about your AIs values remotely at the speed at which you have to make the decisions to launch nukes, you have basically no shot at surviving and propagating human values anyways).
        samuelshadrach 20 Jun 2025 12:35 UTC
        1 point
        0
        Parent
        We don’t have a referendum on any country’s first or second strike policies either.
        
        I’m basically saying in practice we rarely have referendums on anything, and getting one to happen requires an unusual amount of coordinated rebellion against whoever the current leader is.
        
        It’s usually a handful of elites who get votes or money and then do whatever. Selecting a leader is already the result of this whole power strugggle.
        
        A leader will just say that if you don’t like their values then you shouldn’t have voted for them.
        
        Another datapoint: how social media gets governed under Trump or Biden admin.
        M. Y. Zuo 19 Jun 2025 2:36 UTC
        −7 points
        −3
        Parent
        How do you know any of this to any degree of certainty?
        Has anyone even demonstrated a semi-rigorous ⁵⁰⁄₅₀ argument for why “racing” would lead to “ultimately sacrificing the future”? If not, then clearly anything more contentious or claimed to be more certain would have an even higher bar to clear.
        And that’s already pretty generous, probably below what it would take to publish into even third tier journals in many fields.
        Julian Bradshaw 18 Jun 2025 21:52 UTC
        4 points
        0
        Parent
        Ah, if your position is “we should only have humans as primary sources of values in the CEV because that is the only workable schelling point”, then I think that’s very reasonable. My position is simply that, morally, I think that schelling point is not what I’d want. I’d want human-like sapients to be included. (rough proxy: beings that would fit well in Star Trek’s Federation ought to qualify)
        But of course you’d say it doesn’t matter what I (or vegan EAs) want because that’s not the schelling point and we don’t have a right to impose our values, which is a fair argument.
        What links here?
        Julian Bradshaw's comment on Moral Alignment: An Idea I’m Embarrassed I Didn’t Think of Myself by Gordon Seidoh Worley (18 Jun 2025 19:31 UTC; 4 points)
        Garrett Baker 18 Jun 2025 21:15 UTC
        17 points
        19
        Parent
        I think the point of the “weigh the preference of everyone in the world equally” position here is not in spite of, but because of the existence of powerful actors who will try to skew the decision such that they or their group have maximal power. We (you and me) would rather this not happen, and I at least would like to team up with others who would rather this not happen, and we others can have the greatest chance of slapping down those trying to take over the world by advocating for the obvious. That is, by advocating that we should all be equal.
        
        If the vegans among us argue that animals’ preferences should be added to the pool, and the mormons argue that God’s should be taken into account infinitely, and the tree-hugggers that we should CEV the trees, and the Gaia lovers that we should CEV the earth, and the e/accs that we should CEV entropy, and the longtermists that future people should be added, and the near-termists that present people’s influence should be x-times bigger than the future peoples, and the ancestor worshippers want to CEV their dead great-great-great-great-great-...-great grandfathers, and none will join unless their requirements are met, then now we no longer have any hope of coordinating. We get the default outcome, and you are right, the default outcome is the powerful stomp on the weak.
        Buck 19 Jun 2025 2:53 UTC
        3 points
        −1
        Parent
        My guess is that neither of us will hear about any of these discussions until after they’re finalized.
        Garrett Baker 19 Jun 2025 15:30 UTC
        5 points
        0
        Parent
        It sounds like there’s an implied “and therefore we have no influence over such discussions”. If so, then what are we arguing for? What does it matter if Julian Bradshaw and others think animals being left out of the CEV makes it a bad alignment target?
        
        In either case, I don’t think we will only hear about any of these discussions until after they’re finalized. The AI labs are currently aligning and deploying (internally and externally) their AI models through what is likely to be the same process they use for ‘the big one’. Those discussions are these discussions, and we are hearing about them!
        Buck 19 Jun 2025 15:47 UTC
        3 points
        0
        Parent
        What does it matter if Julian Bradshaw and others think animals being left out of the CEV makes it a bad alignment target?
        I wasn’t arguing about this because I care what Julian advocates for in a hypothetical global referendum on CEV, I was just arguing for the usual reason of wanting to understand things better and cause others to understand them better, under the model that it’s good for LWers (including me) to have better models of important topics.
        In either case, I don’t think we will only hear about any of these discussions until after they’re finalized. The AI labs are currently aligning and deploying (internally and externally) their AI models through what is likely to be the same process they use for ‘the big one’. Those discussions are these discussions, and we are hearing about them!
        My guess is that the situation around negotiations for control of the long run future will be different.
      - Buck 19 Jun 2025 15:52 UTC
        3 points
        0
        Parent
        Habryka do you at least agree that the majority of LWers who would be happy to define CEV if asked would not (if prompted) make the argument that the set of people included is intended as a compromise to make the bargaining easier?
        habryka 19 Jun 2025 16:10 UTC
        6 points
        4
        Parent
        Depends on the definition of “majority of LWers”. LW has tens of thousands of users. My guess is if you limit to the people who have written about CEV themselves you would get the right answer, and if you include people who have thought about it for like 10 minutes while reading all the other stuff you would get the wrong answer. If you take an expansive definition I doubt you would get an accurate answer for almost anything one could ask about.
        Given that like half of the CEV article on Arbital makes approximately this point over and over again, my guess is most people who read that article would easily get it right.
      - [ ]
        [deleted]
  - Julian Bradshaw 18 Jun 2025 19:16 UTC
    2 points
    0
    Parent
    I agree that in terms of game theory you’re right, no need to include non-humans as primary sources of values for the CEV. (barring some scenarios where we have powerful AIs that aren’t part of the eventual singleton/swarm implementing the CEV)
    But I think the moral question is still worthwhile?
    - habryka 18 Jun 2025 19:18 UTC
      4 points
      1
      Parent
      But I think the moral question is still worthwhile?
      It’s definitely a very worthwhile question, and also probably a quite difficult one, which is why I would like to bring a superintelligence running CEV to bear on the question.
      Less flippantly: I agree the question of how to treat animals and their values and preferences is important, but it does to me seem like the kind of question you can punt on until you are much smarter and in a much better position to answer it. The universe is long and I don’t see a need to rush this question.
      - Julian Bradshaw 18 Jun 2025 19:31 UTC
        4 points
        1
        Parent
        No I’m saying it might be too late at that point. The moral question is “who gets to have their CEV implemented?” OP is saying it shouldn’t be only humans, it should be “all beings everywhere”. If we implement an AI on Humanity’s CEV, then the only way that other sapient beings would get primary consideration for their values (not secondary consideration where they’re considered only because Humanity has decided to care about their values) would be if Humanity’s CEV allows other beings to be elevated to primary value sources alongside Humanity. That’s possible I think, but not guaranteed, and EAs concerned with ex. factory farming are well within their rights to be concerned that those animals are not going to be saved any time soon under a Humanity’s CEV-implementing AI.
        Now, arguably they don’t have a right as a minority viewpoint to control the value sources for the one CEV the world gets, but obviously from their perspective they want to prevent a moral catastrophe by including animals as primary sources of CEV values from the start.
        
        Edit: confusion clarified in comment chain here.
        habryka 18 Jun 2025 19:51 UTC
        11 points
        7
        Parent
        I… don’t understand? I only care about my own values being included in the CEV. You only care about your own values (and you know, other sources of value correlated with your own) being included in the CEV. Why do I care if we include animals? They are not me. I very likely care about them and will want to help them, but I see absolutely no reason to make that decision right now in a completely irreversible way.
        I do not want anyone else to get primary considerations for their values. Ideally it would all be my own! That’s literally what it means to care about something.
        I don’t know what you are talking about with “they”. You, just as much as me, just want to have your own values included in the CEV.
        Cole Wyeth 18 Jun 2025 22:07 UTC
        2 points
        0
        Parent
        I seem to have had essentially this exact conversation in a different comment thread on this post with the OP.
- Viliam 19 Jun 2025 14:51 UTC
  8 points
  2
  Parent
  As a quick check, do you believe that a CEV that is 50% humans and 50% spiders is preferable to a CEV that is 100% humans? (A future with a lot of juicy beings wrapped in webs while the acid dissolves them from inside—seems to be something that spiders value a lot.)
  - Julian Bradshaw 19 Jun 2025 16:07 UTC
    2 points
    0
    Parent
    No, although if the “juicy beings” are only unfeeling bugs, that might not be as bad as it intuitively sounds.
    There’s a wrinkle to my posts here where partly I’m expressing my own position (which I stated elsewhere as “I’d want human-like sapients to be included. (rough proxy: beings that would fit well in Star Trek’s Federation ought to qualify)”) and partly I’m steelmanning the OP’s position, which I’ve interpreted as “all beings are primary sources of values for the CEV”.
    In terms of how various preferences involving harming other beings could be reconciled into a CEV: yeah it might not be possible. Maybe the harmed beings are simulated/fake somehow? Maybe animals don’t really have preferences about reality vs. VR, and every species ends up in their own VR world...
- MichaelDickens 18 Jun 2025 19:02 UTC
  −1 points
  −6
  Parent
  I expect that the CEV of human values would indeed accord moral status to animals. But including humans-but-not-animals in the CEV still seems about as silly to me as including Americans-but-not-foreigners and then hoping that the CEV ends up caring about foreigners anyway.
  - Julian Bradshaw 18 Jun 2025 19:21 UTC
    2 points
    0
    Parent
    I think you’ve misunderstood what I said? I agree that a human CEV would accord some moral status to animals, maybe even a lot of moral status. What I’m talking about is “primary sources of values” for the CEV, or rather, what population is the AI implementing the Coherent Extrapolated Volition of? Normally we assume it’s humanity, but OP is essentially proposing that the CEV be for “all beings everywhere”, including animals/aliens/AIs/plants/whatever.
    - MichaelDickens 18 Jun 2025 19:26 UTC
      2 points
      0
      Parent
      I think we are on the same page, I was trying to agree with what you said and add commentary on why I’m concerned about “CEV with humans as the primary source of values”. Although I was only responding to your first paragraph not your second paragraph. I think your second paragraph also raises fair concerns about what a “CEV for all sentient beings” looks like.
    - sunwillrise 18 Jun 2025 19:27 UTC
      1 point
      −7
      Parent
      It seems likely enough to me (for a ton of reasons, most of them enunciated here) that “the CEV of an individual human” doesn’t really make sense as a concept, let alone “the CEV of humanity” or even more broadly “the CEV of all beings everywhere.”
      More directly though, the Orthogonality Thesis alone is sufficient to make “the CEV of all beings everywhere” a complete non-starter unless there are so few other kinds of beings out there that “the CEV of humanity” would likely be a good enough approximation of it anyway (if it actually existed, which I think it doesn’t).
      - Julian Bradshaw 18 Jun 2025 20:02 UTC
        2 points
        0
        Parent
        I admit:
        Human preferences don’t fully cohere, especially when extrapolated
        There are many ways in which “Humanity’s CEV” is fuzzy or potentially even impossible to fully specify
        But I think the concept has staying power because it points to a practical idea of “the AI acts in a way such that most humans think it mostly shares their core values”.^[1] LLMs already aren’t far from this bar with their day-to-day behavior, so it doesn’t seem obviously impossible.
        To go back to agreeing with you, yes, adding new types of beings as primary sources of values to the CEV would introduce far more conflicting sets of preferences, maybe to the point that trying to combine them would be totally incoherent. (predator vs. prey examples, parasites, species competing for the same niche, etc etc.) That’s a strong objection to the “all beings everywhere” idea. It’d certainly be simpler to enforce human preferences on animals.
        ^
        I think of this as meaning the AI isn’t enforcing niche values (“everyone now has to wear Mormon undergarments in order to save their eternal soul”), is not taking obviously horrible actions (“time to unleash the Terminators!”), and is taking some obviously good actions (“I will save the life of this 3-year-old with cancer”). Obviously it would have to be neutral on a lot of things, but there’s quite a lot most humans have in common.