E. P. Cooper comments on Pausing AI Is the Best Answer to Post-Alignment Problems

E. P. Cooper 17 May 2026 18:55 UTC
1 point
0
I want to make a few adjustments to my terminology and clarify a point about what it would really entail for a human to use the “correct” decision theory. The new terminology should better match that of Vladimir Nesov’s newest comment on this subject^[1].
The restrictions I describe in my comment above are actually about the human’s decision to be replaced by an agent that has a utility function (or similar parameter) programmed into the correct decision theory. The list of options the superintelligence presents is important because it is upstream of the human’s choice to be so replaced. Under Nesov’s proposal, the information a superintelligence is allowed to show a human is strictly regulated by the “aggregate,” a fixed point calculated under laws (similar in concept to the laws of physics) held constant by an updateless core. Control over tiny details in the list and its presentation to the human could be used by an intelligent and knowledgeable enough agent to (unnoticeably) manipulate the exact utility function selected. If Nesov’s proposal is implemented, this manipulation may be legitimate council, in the sense that manipulating the human into making illegitimate decisions (according to said human’s fixed point) would be off policy.
^[2]
If the superintelligence was just showing a list with incomprehensible items on it that are claimed to be utility functions, that might not be prohibited. Replacement or modification on a deep level are why the requirements may appear too strict if the case I described previously is assumed to be the standard template. Other cases, for example the question of if a human should be persuaded (incredibly subtlety) to make a sandwich with the pieces of bread in loaf or rotated-to-oppose relative orientation may have loose restrictions, if any at all. This is because Nesov’s proposal involves (tractable, so he claims) self-reference, in a similar style to CEV.
Trying to keep to Nesov’s terminology, what I called an initial dynamic should presumably be called an initial aggregate. “Initial dynamic” may be too suggestive of a particular method for reaching a fixed point, when Nesov’s position currently appears to be that it just must be reasoned to by some method. I stand by my claim that an initial aggregate is always required. This is because, as Nesov says, the fixed point can only be approached (or, hypothetically, reached in a single step) “according to what the aggregated values have figured out so far.” ^[3]
If anyone’s interested, I think a useful task to get started with would be an investigation into what additional constraints (if any) should be applied to the fixed points, beyond Nesov’s requirement that the values you obtain in alternate paths are only considered if they are legitimate according the prior (maybe initial) aggregate, and the potential requirement that values are used to influence what paths are considered at all. These additional requirements would presumably be listed out manually by humans.
^[4]
In an attempt to learn from the past 20+ years of work on CEV, I think it’s important to think about what should happen if your outer alignment method fails to converge. Note that for CEV, some sort of convergence may be obtained if the CEV of the contributors to the AI’s development converges, since it can do the full calculation on all humans that are currently alive while kicking out problem components according to the CEV of the contributors. This may require deciding in advance and/or the sacrifice of a volunteer, however.
Opposed to that, while following Nesov’s proposal it may turn out that most or all humans do not have a legitimate fixed point of the right sort, or the math just turns out not to work for many plausible evolved aliens at all (e.g. only trivial transformations turn out to meet all the desiderata). This outcome is reading above chance on the informal “betting” aggregation I have.
Even if this is not true, it may turn out that some human’s aggregations can not reach a fixed point successfully. This, and the reasons described before, suggests some sort of fallback to be used in that case and possibly others. I describe a potential approach for this at the end of this comment. Note that even the existence of a fallback may be a catastrophic incentive/preference instability problem, as it apparently was for some CEV proposals.
CEV is already hard enough to implement, requiring a fully unleashed lower-order Do What I Mean (DWIM) agent running a decision theory substantially beyond the state of the art, with only alignment running solely through that decision theory preventing it from immediately self-modifying or creating sub-agents to get around restrictions. I think Nesov’s approach may be even harder, given that it requires constant operation instead of being tasked with the creation of a single utility function that will never be reconsidered, among other things. The question about what should be done about humans manipulating other humans for example, given that it is nearly certain that at least one human would have legitimate (according to the fixed point of that person) potential future histories where the successful manipulation of another person occurs while not in the presence of superintelligence. I have great uncertainty about all this, however.
Nesov’s proposal contains so much unformalized content that I am unsure where to begin. For a fallback or alternative, it is possible that a line of attack could be opened by the formalization of a static account of rationality and counterlogicals. This may allow a method where the fixed point finding is skipped, and instead the counterlogical versions of Nesov’s alternate future histories are used to determine the aggregate, with only counterlogicals meeting certain fixed criteria being inspected according to further fixed criteria. This personal aggregate would then lend legitimacy to some histories involving superintelligences, similar to Nesov’s proposal. I am unaware of any progress in these areas, however. It is possible that current work on CEV will not lead to anything that carries over, since I see nothing there that is in the form of rigorous tiling theorems and full designs for the cores of proven-aligned agents. Given such slow progress, and given the perils of trusting AI systems to do this, I think humans would have to individually program each constraint on the counterlogicals. This raises the possibility that humans decades to centuries in the future may do something incorrectly here, either intentionally or unintentionally, as they constrain and evaluate the counterlogicals, even if they delegate to blinded and self-erasing programmed computers as much as possible. I’m not sure what to do.
1. ^
  https://www.lesswrong.com/posts/vzHtHHBJoKATi5SeK/empowerment-corrigibility-etc-are-simple-abstractions-of-a?commentId=BjQrqeKfov946oAKj
2. ^
  Note that while, hypothetically, all this fixed point calculation could be done by the agent that has preferences itself (think: a human calculating for itself) in practice only a superintelligence would be able to accurately find a valid fixed point. If a friendly AI was developed, it would presumably do everything required by Nesov’s proposal on behalf of humans, in the background.
3. ^
  Nesov seems to write like there is only one fixed point, maybe for simplicity, but I don’t see how any practical method would be that precise and accurate. Maybe there would be a “fixed region” in a similar style to the goals achieved under certain proposals for soft optimization.
4. ^
  The initial aggregate could be considered twin to the prior, though since it can’t be multiplied it can’t be mixed in. This is opposed to the classical pair of prior and utility function. In humans, the situation is presumably much more messy than anything described here, however.