habryka comments on Moral Alignment: An Idea I’m Embarrassed I Didn’t Think of Myself

habryka 19 Jun 2025 1:28 UTC
11 points
6
I feel like there is a very natural bottleneck for auditing here, which are the relevant AI instructions, and I think this heavily pushes towards simple principles.
I find the alternative, that you would end up in a world where human values are successfully represented, but highly unequally, without a bunch of people freaking out and racing and then ultimately sacrificing the future, also pretty implausible. I think the default outcome in most of those worlds is that you don’t get any good agreement and consequently mostly just lose it in collateral damage.
I think there is some chance you end up with lots of intermediate bargaining happening facilitated by semi-empowered AI systems, though my guess is those would also for alignment reasons, in good worlds, favor extremely schelling solutions above all the other options. Like, I don’t think Claude’s present personality is that much evidence about what will happen after a lot more RL and RSI, but it seems clear to me Claude would end up choosing some set of instructions that is that cosmopolitan.
I also don’t really get it. Direct democracies exist. We have actually ended up in a situation where “one person one vote” is really surprisingly close to the reality of how we govern humanity. Why such complete dismissal of the idea of extending it one (relatively small) step further?
- sunwillrise 19 Jun 2025 6:50 UTC
  9 points
  −4
  Parent
  Direct democracies exist. We have actually ended up in a situation where “one person one vote” is really surprisingly close to the reality of how we govern humanity.
  Not by the benevolence of the butcher, but because of the self-interest of liberal and (mostly) Western governments. In our current regime, human labor and intellectual output are simply too economically valuable to waste, meaning types of government that maximally allow them to flourish (liberal, constitutional, broadly capitalistic) get an edge, small at first but compounding over time to become decisive. But it’s not logically required for this to continue into the future.^[1]
  I don’t claim to have a complete model here, of course. “Where do (did?) stable, cooperative institutions come from?” seems relevant, to an extent.
  But consider this as an illustrative example: the US famously implemented PNTR with China in 1999 and supported China’s accession into the WTO a couple of years later. Beyond economic matters and the benefits of greater abundance and lower prices, proponents of these moves, such as President Clinton and House Speaker Hastert, argued increased trade and development would expose China to the wealth and prosperity of the West. When confronted with Western culture and the superiority of its living standards, China’s population would demand genuine democracy alongside “decent labor standards, a cleaner environment, human rights and the rule of law.”
  And people mostly believed Clinton and Hastert! Their arguments really caught on. Indeed, people at the time looked at Japan and (especially) South Korea as examples of their thesis being proven correct. But as Matt Yglesias ably explained:
  This idea that trade, development, and democratization would all move together was always controversial. But from what I can remember of the debates at the time, even the sharpest critics of trade with China underestimated exactly how wrong Clinton would be about this.
  For starters, it proved much easier on a technical level to censor the internet than I think non-technical people realized 20 to 25 years ago. But what’s worse is that modern technology, especially since the growth of the smartphone industry, is basically a huge surveillance machine. In the west, that machine is basically used for targeted advertising, which can sometimes feel “creepy” but that I don’t think has a ton of real downsides. But in the People’s Republic of China, it’s been used to craft a more intrusive authoritarian state than the worst dictators of the 20th century could have dreamed of.
  It was precisely the rise of technology that empowered the few at the expense of the many, by breaking the feedback loop of reality → citizens’ beliefs → citizens’ actions → reality that had made “empowering the public” part of the government’s self-interest if it wanted economic growth. In the past, China had had neither public empowerment nor economic prosperity.^[2] Around the early 2000s, it was able to move towards the latter without needing the former.
  1. ^
    Also, there are historical counterexamples, a la Singapore under Lee Kwan Yew
  2. ^
    This cursory analysis skips over the changes under Deng’s regime, for purposes of time
- samuelshadrach 19 Jun 2025 6:43 UTC
  1 point
  0
  Parent
  Empirical datapoint: We don’t run referendums on whether to fire nukes.
  - habryka 19 Jun 2025 6:54 UTC
    4 points
    2
    Parent
    Feels very different, since MAD means you really need the authority to launch nukes on a faster turnaround than a global referendum. But deciding what values you want to give an AI seems like it would require inherently much less time pressure (there might be human-created reasons for time pressure, like arms race dynamics, but I expect in the worlds where you are rushing forward so quickly that you have to make decisions about your AIs values remotely at the speed at which you have to make the decisions to launch nukes, you have basically no shot at surviving and propagating human values anyways).
    - samuelshadrach 20 Jun 2025 12:35 UTC
      1 point
      0
      Parent
      We don’t have a referendum on any country’s first or second strike policies either.
      
      I’m basically saying in practice we rarely have referendums on anything, and getting one to happen requires an unusual amount of coordinated rebellion against whoever the current leader is.
      
      It’s usually a handful of elites who get votes or money and then do whatever. Selecting a leader is already the result of this whole power strugggle.
      
      A leader will just say that if you don’t like their values then you shouldn’t have voted for them.
      
      Another datapoint: how social media gets governed under Trump or Biden admin.
- M. Y. Zuo 19 Jun 2025 2:36 UTC
  −7 points
  −3
  Parent
  How do you know any of this to any degree of certainty?
  Has anyone even demonstrated a semi-rigorous ⁵⁰⁄₅₀ argument for why “racing” would lead to “ultimately sacrificing the future”? If not, then clearly anything more contentious or claimed to be more certain would have an even higher bar to clear.
  And that’s already pretty generous, probably below what it would take to publish into even third tier journals in many fields.