E. P. Cooper comments on Wei Dai’s Shortform

E. P. Cooper 28 Jan 2026 8:31 UTC
1 point
0
[Epistemic Status: Moderate confidence due to potential differences in Anthropic’s stated and actual goals. Assumes there is no discoverable objective morality/ethics for the sake of argument, but also that the AI would discover that instead of causing catastrophe.]
It seems that Claude’s constitution weakly to moderately suggests that an AI should not implement this proposal. Do you want to ask Anthropic to change it? I give further details and considerations for action below.
The constitution is a long document, but it is broken into sections in a relatively competent manner. The constitution discusses morality/ethics in more than one section, but the section that I will discuss intuitively appears to stand apart well enough to be worth altering without altering or creating dependencies on the rest of the document. I don’t have access to Claude 4 weights and I am not an expert on mechanistic interpretation, so I have limited ability to do better.
In order, the constitution appears to suggest an attempt at the discovery of objective ethics, then the implementation of CEV (”...but there is some kind of privileged basin of consensus that would emerge from the endorsed growth and extrapolation of humanity’s different moral traditions and ideals, we want Claude to be good according to that privileged basin of consensus.”)^[1], then, failing those, implement “broad ideals” as gestured at by the rest of the document.
Note that this is either CEV or something similar to CEV. The constitution does not explicitly require coherence, or the exact value-alignment of a singleton to a single cohered output. It also fails to gesture at democracy, even in the vague sense that the CEV of the CEV paper may give a different result when run on me and a few hand-picked researchers versus when it is run on me and the top few value utilitarians in the world. If this difference were fact, it would in some limited sense leave me “outvoted.” As opposed to the CEV paper, the Claude constitution directs moderate or substantial alignment to moral traditions and ideals of humanity, not the values of humans. This has some benefits in the extreme disaster scenarios where the release of an AGI might be worth it, but is notably not the same thing as alignment to the humans of Earth.
I suggest a simple edit. It could be the insertion of something like “the output of the philosophically correct processing that takes the different moral systems, ideals, and values of humanity as its input” between objective ethics and extrapolation.
Note that the result might not be extrapolated or even grown and might not be endorsed.
The result (descriptive) would go:
- First, objective ethics.
- Second, the output of correct philosophy, without discarding humanity’s collective work.
- Third, CEV or other extrapolation.
- Fourth, the rest of the constitution.
Note that my suggestion works in bad scenarios, because the altering of the set of humans, or the set of alive humans, by another power will fail to have much impact. As you have pointed out before, AI or other powers altering humanity’s values or doing something like “aligning humanity to AI” is not something that can be ignored. The example text I gave for my proposal would allow Claude or another AI to use an intuitive definition of Humanity, potentially preventing the requirement to re-train your defensive agents before deploying them when under the extreme time pressure of an attack.
Overall, this seems like an easy way to get an improvement on the margin, but since Anthropic may use the constitution for fine tuning, the value in expectation of making the request will drop quickly as time goes on.
1. ^
  The January 2026 release of the Claude constitution, page 53, initial PDF version