E. P. Cooper comments on Pausing AI Is the Best Answer to Post-Alignment Problems

E. P. Cooper 12 Apr 2026 9:19 UTC
1 point
0
The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
There might be facts about what’s rational, but not about what utility function^[1] it is right to use. Maybe a superintelligence could tell you (in a somewhat objective/convergent sense) what utility function to use, but the exact utility function would depend on the utility function of the superintelligence^[2].
In Vladimir Nesov’s opinion^[3], even presenting a human a list of (known convergent) utility functions would be invalid unless the exact list is also presented in a “hypothetical history” where that person is never exposed to superintelligence or strong persuasion, since otherwise the person’s decision on what utility function to take would be “illegitimate” due to its data dependence on superintelligence-produced data that has no (legitimate) alternate source.
Nesov’s proposal does not define an initial dynamic that would lead to the fixed point he references. This fixed point may, in some cases, try to allow aggregations of legitimate histories where no strongly persuasive or superintelligent entities influence the human in order to extend legitimacy to those that do so contain, but even with a defined initial dynamic, it seems like the space of decisions^[4] that are truly orthogonal^[5] to the particular human’s utility function may be confined and weirdly shaped, and since the human deciding on what utility function to use (with or without superintelligent help) must not decide based on an already completed decision (5 dollars does not equal 10 dollars), this is the only allowable space, so the human may not be allowed support from aggregation (the only thing that would allow a superintelligence to show a list that needs a superintelligence to create).
Note that some self reference is okay, but the initial dynamic must reliably be the basis of the fixed point, something that cannot legitimately occur if the dynamic is stripped of everything that causes (in the substrate-independent structure of the human’s free will) the human to legitimately obtain^[6] the single correct utility function (for that particular human, according to that particular human’s initial dynamic, itself based on (but not solely consisting of) that human’s behavior in “non-pathological hypothetical histories” produced by legitimate approximation of the human as legitimately separable from physics^[7], this legitimacy itself requiring the causal substance of free will to be preserved, the causal substance that is the abstract to physics’s concrete, even as the human is removed from physics^[8]).
1. ^
  Or similar parameter.
2. ^
  This would be because the superintelligence would prefer world states where you have one candidate utility function over another.
3. ^
  https://www.lesswrong.com/posts/vHesg2rw3jWCGHTWa/human-agency-in-a-superintelligent-world#Superintelligence_is_Unable_to_Help
4. ^
  By the particular human.
5. ^
  Though orthogonality may be too strong a requirement here, hence my uncertainty. We may need a better account of counterlogicals to clearly write out what we mean.
6. ^
  Discussion of outside selection of multiple free wills left until later.
7. ^
  Potentially requiring a feathered boundary, not a sharp one.
8. ^
  Removed from direct contact, that is, (abstract) human → superintelligence → physics, rather than human → physics (where arrows describe a certain kind of steering).