Thane Ruthenis comments on Training AGI in Secret would be Unsafe and Unethical

Thane Ruthenis 18 Apr 2025 22:33 UTC
53 points
16
I also think there is a genuine alternative in which power never concentrates to such an extreme degree.
I don’t see it.
The distribution of power post-ASI depends on the constraint/goal structures instilled into the (presumed-aligned) ASI. That means the entity in whose hands all power is concentrated are the people deciding on what goals/constraints to instill into the ASI, in the time prior to the ASI’s existence. What people could those be?
1. By default, it’s the ASI’s developers, e. g., the leadership of the AGI labs. “They will be nice and put in goals/constraints that make the ASI loyal to humanity, not to them personally” is more or less isomorphic to “they will make the ASI loyal to them personally, but they’re nice and loyal to humanity”; in both cases, they have all the power.^[1]
2. If the ASI’s developers go inform the US’s President about it in a faithful way^[2], the overwhelming power will end up concentrated in the hands of the President/the extant powers that be. Either by way of ham-fisted nationalization (with something isomorphic to putting guns to the developers’ (families’) heads), or by subtler manipulation where e. g. everyone is forced to LARP believing in the US’ extant democratic processes (which the President would be actively subverting, especially if that’s still Trump), with this LARP being carried far enough to end up in the ASI’s goal structure.
  - The stories in which the resultant power struggles shake out in a way that leads to the humanity-as-a-whole being given true meaningful input in the process (e. g., the slowdown ending in AI-2027) seem incredibly fantastical to me. (Again, especially given the current US administration.)
  - Yes, acting in ham-fisted ways would be precarious and have various costs. But I expect the USG to be able to play it well enough to avoid actual armed insurrection (especially given that the AGI concerns are currently not very legible to the public), and inasmuch as they actually “feel the AGI”, they’d know that nothing less than that would ultimately matter.
3. If the ASI’s developers somehow go public with the whole thing, and attempt to unilaterally set up some actually-democratic process for negotiating on the ASI goal/constraint structures, then either (1) the US government notices it, realizes what’s happening, takes control, and subverts the process, (2) they set up some very broken process – as broken as the US electoral procedures which end up with Biden and Trump as Top 2 choice of president – and those processes output some basically random, potentially actively harmful results (again, something as bad as Biden vs. Trump).
Fundamentally, the problem is that there’s currently no faithful mechanism of human preference agglomeration that works at scale. That means, both, that (1) it’s currently impossible to let humanity-as-a-whole actually weigh in on the process, (2) there are no extant outputs of that mechanism around, all people and systems that currently hold power aren’t aligned to humanity in a way that generalizes to out-of-distribution events (such as being given godlike power).
Thus, I could only see three options:
- Power is concentrated in some small group’s hands, with everyone then banking on that group acting in a prosocial way, perhaps by asking the ASI to develop a faithful scalable preference-agglomeration process. (I. e., we use a faithful but small-scale human-preference-agglomeration process.)
- Power is handed off to some random, unstable process. (Either a preference agglomeration system as unfaithful as US’ voting systems, or “open-source the AGI and let everyone in the world fight it out”, or “sample a random goal system and let it probably tile the universe with paperclips”.)
- ASI development is stopped and some different avenue of intelligence enhancement (e. g., superbabies) is pursued; one that’s more gradual and is inherently more decentralized.
1. ^
  A group of humans that compromises on making the ASI loyal to humanity is likely more realistic than a group of humans which is actually loyal to humanity. E. g., because the group has some psychopaths and some idealists, and all psychopaths have to individually LARP being prosocial in order to not end up with the idealists ganging up against them, with this LARP then being carried far enough to end up in the ASI’s goals. But this still involves that small group having ultimate power; still involves the future being determined by how the dynamics within that small group shake out.
2. ^
  Rather than keeping him in the dark or playing him, which reduces to Scenario 1.
- Vladimir_Nesov 19 Apr 2025 4:26 UTC
  8 points
  3
  Parent
  
  the entity in whose hands all power is concentrated are the people deciding on what goals/constraints to instill into the ASI
  
  Its goals could also end up mostly forming on their own, regardless of intent of those attempting to instill them, with indirect influence from all the voices in the pretraining dataset.
  
  Consider what it means for power to “never concentrate to an an extreme degree”, as a property of the civilization as a whole. This might also end up a property of an ASI as a whole.
- MichaelDickens 19 Apr 2025 4:11 UTC
  8 points
  0
  Parent
  I think there is a fourth option (although it’s not likely to happen):
  1. Indefinitely pause AI development.
  2. Figure out a robust way to do preference agglomeration.
  3. Encode #2 into law.
  4. Resume AI development (after solving all other safety problems too, of course).
  I was going to say step 2 is “draw the rest of the owl” but really this plan has multiple “draw the rest of the owl” steps.
  - Thane Ruthenis 19 Apr 2025 4:18 UTC
    3 points
    0
    Parent
    Mm, yeah, maybe. The key part here is, as usual, “who is implementing this plan”? Specifically, even if someone solves the the preference-agglomeration problem (which may be possible to do for a small group of researchers), why would we expect it to end up implemented at scale? There are tons of great-on-paper governance ideas which governments around the world are busy ignoring.
    For things like superbabies (or brain-computer interfaces, or uploads), there’s at least a more plausible pathway for wide adoption, similar motives for maximizing profit/geopolitical power as with AGI.
- Aprillion 20 Apr 2025 8:01 UTC
  0 points
  0
  Parent
  I can see that if Moloch is a force of nature, any wannabe singleton would collapse under internal struggles… but it’s not like that would show me any lever AI safety can pull, it would be dumb luck if we live in a universe where the ratio of instrumentally convergent power concentration to it’s inevitable schism is less than 1 ¯\_(ツ)_/¯