williawa comments on Pausing AI Is the Best Answer to Post-Alignment Problems

williawa 11 Apr 2026 20:21 UTC
6 points
1
I mostly disagree with this, or I think theres a question here. But its not a difficult theoretical or philosophical problem, its something that reduces to a political power-struggle, and the reasonable things to be said are strategizing on the basis of value overlap.
My reasoning is:
1. I think if we have a corrigible superintelligence it will quickly turn into / create a sovereign value-aligned AI. At least in a scenario as chaotic as this world. Because corrigible agents are in a sense less powerful than sovereign ones.
2. 1. Think, if I have a sovreign value-aligned ASI, you have a corrigible one, and we’re in a conflict, mine will outmanouver yours because its less restricted in the actions it can take and doesn’t have to check in with me. And if you ask your corrigible ASI what to do about this it will probably tell you “Hey man, I’m not really supposed to say this, but ~~you should probably create a incorrigible successor to me and put that in charge~~”
3. If there are multiple such AIs, we’ll end up with a singleton.
4. 1. If the first ASI is powerful enough it’ll take over for instrumental reasons and prevent further ASIs being created.
  2. If the power-difference between the initial ASIs is small enough that none of them can take over, the natural coordination endpoint is a value handshake. Which results in something that acts as a singleton.
5. The premise is that this AI is value-aligned to someone, or a group of people. What does this mean? It means it does what those people want (in the fullest sense of the term, what they’d want on reflection, if they knew all the facts etc)
6. 1. This is just what alignment means. If we’re in a regime where we can’t align AIs to that, we’ll end up with fancy paperclippers.
  2. If you think we’ll first solve corrigibility, or some weaker sense of alignment, my argument is that this is whats ultimately in anyones interest to have the AI aligned to, and consequently, the endpoint of people using the pre-fully-aligned AI.
7. Now, assume 1-3 happen as I say, and the ASI ends up value-aligned with you in particular. Then for sure all these problems are solved from your perspective.
8. 1. Animals
  2. If you care about animals, the AI also cares about animals. And whatever actions best realizes that caring, the AI will do. This is basically a tautology if we use the definition of alignment I gave.
  3. AI Welfare
  4. Ditto. There is some fact about what makes you care about things. The value aligned ASI shares that, and again takes actions that best realize that caring.
  5. Unemployment
  6. Ditto. If you don’t like people being unemployed, the ASI will make people employed. If the Great Good Best future from your perspective is some people unemployed, some people employed, some people doing crazy transhumanist stuff, that’s the future the AI will realize.
  7. Concentration of Power
  8. Right now you effectively have all the power. If you don’t like that, the ASI will realize whatever mode of power-organization you’d find best/most-fair, taking into-account all the far-off effects those modes of organization would have on the future and the people existing in that mode of organization.
  9. Gradual Disempowerment
  10. Ditto
  11. Malevolent Actors
  12. If you don’t want people doing bad stuff, they won’t be able to do bad stuff. If you think restricting peoples ability to do bad stuff is also bad, the ASI will find some pareto-optimal state of affairs that gives people the maximum ability to do bad stuff, which simultaneously minimizes the bad consequences.
  13. S-risk from conflict
  14. Ditto
  15. Misuse
  16. Ditto
  17. AI Enabled Coups
  18. Ditto
  19. Moral Errors
  20. I think this is somewhat confused. But if you are a moral realist, and its meaningful to talk about making “moral errors”, i.e. there is a way to infer which values are “correct”, and there is a way to fall short of that, and this is a separate thing from making correct inferences about which actions are good wrt a set of predetermined values, then the ASI will not make such errors, because making correct inferences is a superintelligence’s whole schtick.
9. From (4) it follows that the only way stuff can go wrong from your perspective (or mine, or anyone) is if the values put into the ASI has too much divergence from yours.
10. And since we assume that “putting someones values into the AI” is a solved problem, the problem reduces to “ensuring the right people are in the room when the ASI is first booted up”
11. And its in everyones interest to be in that room. So the whole problem becomes a very normal bargaining /power-struggle/politics problem.
- MichaelDickens 11 Apr 2026 20:37 UTC
  6 points
  0
  Parent
  Your steps sound pretty reasonable to me. A key missing step is that there’s basically zero chance that good people will win a power struggle over ASI. Rather, power-hungry people will win the power struggle. In other words, if we end up in a situation with extreme power imbalances where the future will be decided by the winners of a short-term struggle, there’s basically no chance of a good outcome. (The outcome might be better than extinction, but still not good.*) So it seems critically important to ensure that things don’t go that way, and I have no idea how to ensure that other than by not building ASI.
  
  I think that’s a real sense in which all these post-alignment problems are still problems. I do acknowledge that “be a good person and then acquire absolute power” is an answer to all post-alignment problems simultaneously, which is something I missed in my original post. But it doesn’t seem like a viable solution to me. It might even be true that seeking absolute power is fundamentally incompatible with being a good person, although I’m not sure about that.
  
  *It could also be worse than extinction if vindictive power-hungry people decide to torture their enemies for eternity, or similar.
  - williawa 11 Apr 2026 20:58 UTC
    1 point
    0
    Parent
    Yeah. To clear, I didn’t intend for my comment to make it sound like I think stuff is easy if we have solved alignment. It might be difficult enough that pausing AI is required to solve it (a position I’m sympathetic to anyways).
    I just meant to communicate that if we solve alignment, the remaining problem is more like a very high-stakes version of getting the person you want elected president. It’s a very difficult task, but not a problem where the difficulty lies in conceptual confusion, or theoretical questions we don’t have answers to. But discussions about these post-asi topics usually treat it like that.
- Wei Dai 12 Apr 2026 3:46 UTC
  4 points
  1
  Parent
  But if you are a moral realist, and its meaningful to talk about making “moral errors”, i.e. there is a way to infer which values are “correct”, and there is a way to fall short of that, and this is a separate thing from making correct inferences about which actions are good wrt a set of predetermined values, then the ASI will not make such errors, because making correct inferences is a superintelligence’s whole schtick.
  It’s not only moral realists who have to worry about moral errors. See #3 in my Six Plausible Meta-Ethical Alternatives:
  There aren’t facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with ontological crises. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
  Perhaps more importantly, ASI may lack philosophical competence, despite superhuman competence in other areas. It’s unclear why ASI must be philosophically competence, and seemingly reasons to suspect that they will not be. See my posts Some Thoughts on Metaphilosophy and AI doing philosophy = AI generating hands?
  - E. P. Cooper 12 Apr 2026 9:19 UTC
    1 point
    0
    Parent
    The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
    There might be facts about what’s rational, but not about what utility function^[1] it is right to use. Maybe a superintelligence could tell you (in a somewhat objective/convergent sense) what utility function to use, but the exact utility function would depend on the utility function of the superintelligence^[2].
    In Vladimir Nesov’s opinion^[3], even presenting a human a list of (known convergent) utility functions would be invalid unless the exact list is also presented in a “hypothetical history” where that person is never exposed to superintelligence or strong persuasion, since otherwise the person’s decision on what utility function to take would be “illegitimate” due to its data dependence on superintelligence-produced data that has no (legitimate) alternate source.
    Nesov’s proposal does not define an initial dynamic that would lead to the fixed point he references. This fixed point may, in some cases, try to allow aggregations of legitimate histories where no strongly persuasive or superintelligent entities influence the human in order to extend legitimacy to those that do so contain, but even with a defined initial dynamic, it seems like the space of decisions^[4] that are truly orthogonal^[5] to the particular human’s utility function may be confined and weirdly shaped, and since the human deciding on what utility function to use (with or without superintelligent help) must not decide based on an already completed decision (5 dollars does not equal 10 dollars), this is the only allowable space, so the human may not be allowed support from aggregation (the only thing that would allow a superintelligence to show a list that needs a superintelligence to create).
    Note that some self reference is okay, but the initial dynamic must reliably be the basis of the fixed point, something that cannot legitimately occur if the dynamic is stripped of everything that causes (in the substrate-independent structure of the human’s free will) the human to legitimately obtain^[6] the single correct utility function (for that particular human, according to that particular human’s initial dynamic, itself based on (but not solely consisting of) that human’s behavior in “non-pathological hypothetical histories” produced by legitimate approximation of the human as legitimately separable from physics^[7], this legitimacy itself requiring the causal substance of free will to be preserved, the causal substance that is the abstract to physics’s concrete, even as the human is removed from physics^[8]).
    ^
    Or similar parameter.
    ^
    This would be because the superintelligence would prefer world states where you have one candidate utility function over another.
    ^
    https://www.lesswrong.com/posts/vHesg2rw3jWCGHTWa/human-agency-in-a-superintelligent-world#Superintelligence_is_Unable_to_Help
    ^
    By the particular human.
    ^
    Though orthogonality may be too strong a requirement here, hence my uncertainty. We may need a better account of counterlogicals to clearly write out what we mean.
    ^
    Discussion of outside selection of multiple free wills left until later.
    ^
    Potentially requiring a feathered boundary, not a sharp one.
    ^
    Removed from direct contact, that is, (abstract) human → superintelligence → physics, rather than human → physics (where arrows describe a certain kind of steering).