Buck comments on Moral Alignment: An Idea I’m Embarrassed I Didn’t Think of Myself

Buck 18 Jun 2025 20:49 UTC
15 points
11
Thanks heaps for pointing out the Eliezer content!
I am very skeptical that you’ll get “all of humanity equally” as the bargaining solution, as opposed to some ad hoc thing that weighs powerful people more. I’m not aware of any case where the solution to a bargaining problem was “weigh the preference of everyone in the world equally”. (This isn’t even how most democracies work internally!)
- habryka 18 Jun 2025 20:59 UTC
  24 points
  16
  Parent
  I think it’s the option I would throw my weight behind, largely because the difference between (as Eliezer says) “starting a slap fight over world domination” and “having any kind of reasonable weight allocation” is so enormously big by my lights, that I really wouldn’t want to quibble over the details.
  If there is another more Schelling option I would also be up for that, but I do have a feeling that as the details get more complicated, the ability to actually coordinate on any specific option, as opposed to fighting over which option it should be, by racing towards getting the god-machine first, gets a lot worse. The schellingness really weighs heavily here for me, and “each alive human one vote” seems like the most Schelling to me, though IDK, maybe someone can propose something even better and then I would also be happy to back that.
  - Buck 19 Jun 2025 1:05 UTC
    10 points
    0
    Parent
    I think it’s very unlikely that (conditioned on no AI takeover) something similar to “all humans get equal weight in deciding what happens next” happens; I think that a negotiation between a small number of powerful people (some of whom represent larger groups, e.g. nations) that ends with an ad hoc distribution seems drastically more likely. The bargaining solution of “weight everyone equally” seems basically so implausible that it seems pointless to even discuss it as a pragmatic solution.
    - habryka 19 Jun 2025 1:28 UTC
      11 points
      6
      Parent
      I feel like there is a very natural bottleneck for auditing here, which are the relevant AI instructions, and I think this heavily pushes towards simple principles.
      I find the alternative, that you would end up in a world where human values are successfully represented, but highly unequally, without a bunch of people freaking out and racing and then ultimately sacrificing the future, also pretty implausible. I think the default outcome in most of those worlds is that you don’t get any good agreement and consequently mostly just lose it in collateral damage.
      I think there is some chance you end up with lots of intermediate bargaining happening facilitated by semi-empowered AI systems, though my guess is those would also for alignment reasons, in good worlds, favor extremely schelling solutions above all the other options. Like, I don’t think Claude’s present personality is that much evidence about what will happen after a lot more RL and RSI, but it seems clear to me Claude would end up choosing some set of instructions that is that cosmopolitan.
      I also don’t really get it. Direct democracies exist. We have actually ended up in a situation where “one person one vote” is really surprisingly close to the reality of how we govern humanity. Why such complete dismissal of the idea of extending it one (relatively small) step further?
      - sunwillrise 19 Jun 2025 6:50 UTC
        9 points
        −4
        Parent
        Direct democracies exist. We have actually ended up in a situation where “one person one vote” is really surprisingly close to the reality of how we govern humanity.
        Not by the benevolence of the butcher, but because of the self-interest of liberal and (mostly) Western governments. In our current regime, human labor and intellectual output are simply too economically valuable to waste, meaning types of government that maximally allow them to flourish (liberal, constitutional, broadly capitalistic) get an edge, small at first but compounding over time to become decisive. But it’s not logically required for this to continue into the future.^[1]
        I don’t claim to have a complete model here, of course. “Where do (did?) stable, cooperative institutions come from?” seems relevant, to an extent.
        But consider this as an illustrative example: the US famously implemented PNTR with China in 1999 and supported China’s accession into the WTO a couple of years later. Beyond economic matters and the benefits of greater abundance and lower prices, proponents of these moves, such as President Clinton and House Speaker Hastert, argued increased trade and development would expose China to the wealth and prosperity of the West. When confronted with Western culture and the superiority of its living standards, China’s population would demand genuine democracy alongside “decent labor standards, a cleaner environment, human rights and the rule of law.”
        And people mostly believed Clinton and Hastert! Their arguments really caught on. Indeed, people at the time looked at Japan and (especially) South Korea as examples of their thesis being proven correct. But as Matt Yglesias ably explained:
        This idea that trade, development, and democratization would all move together was always controversial. But from what I can remember of the debates at the time, even the sharpest critics of trade with China underestimated exactly how wrong Clinton would be about this.
        For starters, it proved much easier on a technical level to censor the internet than I think non-technical people realized 20 to 25 years ago. But what’s worse is that modern technology, especially since the growth of the smartphone industry, is basically a huge surveillance machine. In the west, that machine is basically used for targeted advertising, which can sometimes feel “creepy” but that I don’t think has a ton of real downsides. But in the People’s Republic of China, it’s been used to craft a more intrusive authoritarian state than the worst dictators of the 20th century could have dreamed of.
        It was precisely the rise of technology that empowered the few at the expense of the many, by breaking the feedback loop of reality → citizens’ beliefs → citizens’ actions → reality that had made “empowering the public” part of the government’s self-interest if it wanted economic growth. In the past, China had had neither public empowerment nor economic prosperity.^[2] Around the early 2000s, it was able to move towards the latter without needing the former.
        ^
        Also, there are historical counterexamples, a la Singapore under Lee Kwan Yew
        ^
        This cursory analysis skips over the changes under Deng’s regime, for purposes of time
        What links here?
        sunwillrise's comment on A Conservative Vision For AI Alignment by Davidmanheim (21 Aug 2025 19:00 UTC; 32 points)
      - samuelshadrach 19 Jun 2025 6:43 UTC
        1 point
        0
        Parent
        Empirical datapoint: We don’t run referendums on whether to fire nukes.
        habryka 19 Jun 2025 6:54 UTC
        4 points
        2
        Parent
        Feels very different, since MAD means you really need the authority to launch nukes on a faster turnaround than a global referendum. But deciding what values you want to give an AI seems like it would require inherently much less time pressure (there might be human-created reasons for time pressure, like arms race dynamics, but I expect in the worlds where you are rushing forward so quickly that you have to make decisions about your AIs values remotely at the speed at which you have to make the decisions to launch nukes, you have basically no shot at surviving and propagating human values anyways).
        samuelshadrach 20 Jun 2025 12:35 UTC
        1 point
        0
        Parent
        We don’t have a referendum on any country’s first or second strike policies either.
        
        I’m basically saying in practice we rarely have referendums on anything, and getting one to happen requires an unusual amount of coordinated rebellion against whoever the current leader is.
        
        It’s usually a handful of elites who get votes or money and then do whatever. Selecting a leader is already the result of this whole power strugggle.
        
        A leader will just say that if you don’t like their values then you shouldn’t have voted for them.
        
        Another datapoint: how social media gets governed under Trump or Biden admin.
      - M. Y. Zuo 19 Jun 2025 2:36 UTC
        −7 points
        −3
        Parent
        How do you know any of this to any degree of certainty?
        Has anyone even demonstrated a semi-rigorous ⁵⁰⁄₅₀ argument for why “racing” would lead to “ultimately sacrificing the future”? If not, then clearly anything more contentious or claimed to be more certain would have an even higher bar to clear.
        And that’s already pretty generous, probably below what it would take to publish into even third tier journals in many fields.
  - Julian Bradshaw 18 Jun 2025 21:52 UTC
    4 points
    0
    Parent
    Ah, if your position is “we should only have humans as primary sources of values in the CEV because that is the only workable schelling point”, then I think that’s very reasonable. My position is simply that, morally, I think that schelling point is not what I’d want. I’d want human-like sapients to be included. (rough proxy: beings that would fit well in Star Trek’s Federation ought to qualify)
    But of course you’d say it doesn’t matter what I (or vegan EAs) want because that’s not the schelling point and we don’t have a right to impose our values, which is a fair argument.
    What links here?
    Julian Bradshaw's comment on Moral Alignment: An Idea I’m Embarrassed I Didn’t Think of Myself by Gordon Seidoh Worley (18 Jun 2025 19:31 UTC; 4 points)
- Garrett Baker 18 Jun 2025 21:15 UTC
  17 points
  19
  Parent
  I think the point of the “weigh the preference of everyone in the world equally” position here is not in spite of, but because of the existence of powerful actors who will try to skew the decision such that they or their group have maximal power. We (you and me) would rather this not happen, and I at least would like to team up with others who would rather this not happen, and we others can have the greatest chance of slapping down those trying to take over the world by advocating for the obvious. That is, by advocating that we should all be equal.
  
  If the vegans among us argue that animals’ preferences should be added to the pool, and the mormons argue that God’s should be taken into account infinitely, and the tree-hugggers that we should CEV the trees, and the Gaia lovers that we should CEV the earth, and the e/accs that we should CEV entropy, and the longtermists that future people should be added, and the near-termists that present people’s influence should be x-times bigger than the future peoples, and the ancestor worshippers want to CEV their dead great-great-great-great-great-...-great grandfathers, and none will join unless their requirements are met, then now we no longer have any hope of coordinating. We get the default outcome, and you are right, the default outcome is the powerful stomp on the weak.
  - Buck 19 Jun 2025 2:53 UTC
    3 points
    −1
    Parent
    My guess is that neither of us will hear about any of these discussions until after they’re finalized.
    - Garrett Baker 19 Jun 2025 15:30 UTC
      5 points
      0
      Parent
      It sounds like there’s an implied “and therefore we have no influence over such discussions”. If so, then what are we arguing for? What does it matter if Julian Bradshaw and others think animals being left out of the CEV makes it a bad alignment target?
      
      In either case, I don’t think we will only hear about any of these discussions until after they’re finalized. The AI labs are currently aligning and deploying (internally and externally) their AI models through what is likely to be the same process they use for ‘the big one’. Those discussions are these discussions, and we are hearing about them!
      - Buck 19 Jun 2025 15:47 UTC
        3 points
        0
        Parent
        What does it matter if Julian Bradshaw and others think animals being left out of the CEV makes it a bad alignment target?
        I wasn’t arguing about this because I care what Julian advocates for in a hypothetical global referendum on CEV, I was just arguing for the usual reason of wanting to understand things better and cause others to understand them better, under the model that it’s good for LWers (including me) to have better models of important topics.
        In either case, I don’t think we will only hear about any of these discussions until after they’re finalized. The AI labs are currently aligning and deploying (internally and externally) their AI models through what is likely to be the same process they use for ‘the big one’. Those discussions are these discussions, and we are hearing about them!
        My guess is that the situation around negotiations for control of the long run future will be different.