The classic idea from Yudkowsky, Christiano, etc. for what to do in a situation like this is to go meta: Ask the AI to predict what you’d conclude if you were a bit smarter, had more time to think, etc. Insofar as you’d conclude different things depending on the initial conditions, the AI should explain what and why.
Yeah, I might be too corrupted or biased to be a starting point for this. It seems like a lot of people or whole societies might not do well if placed in this kind of situation (of having something like CEV being extrapolated from them by AI), so I shouldn’t trust myself either.
You, Wei, are proposing another plan: Ask the AI to simulate thousands of civilizations, and then search over those civilizations for examples of people doing philosophical reasoning of the sort that might appeal to you, and then present it all to you in a big list for you to peruse?
Not a big list to peruse, but more like, to start with, put the whole unfiltered distribution of philosophical outcomes in some secure database, then run relatively dumb/secure algorithms over it to gather statistics/patterns. (Looking at it directly by myself or using any advanced algorithms/AIs might be exposing me/us to infohazards.) For example I’d want to know what percent of civilizations think they’ve solved various problems like decision theory, ethics, metaphilosophy, how many clusters of solutions are there for each problem, are there any patterns/correlations between types/features of intelligence/civilization and what conclusions they ended up with.
This might give me some clues as to which clusters are more interesting/promising/safer to look at, and then I have to figure out what precautions to take before looking at the actual ideas/arguments (TBD, maybe get ideas about this from the simulations too). It doesn’t seem like I can get something similar to this by just asking my AI to “do philosophy”, without running simulations.
[Epistemic Status: Moderate confidence due to potential differences in Anthropic’s stated and actual goals. Assumes there is no discoverable objective morality/ethics for the sake of argument, but also that the AI would discover that instead of causing catastrophe.]
It seems that Claude’s constitution weakly to moderately suggests that an AI should not implement this proposal. Do you want to ask Anthropic to change it? I give further details and considerations for action below.
The constitution is a long document, but it is broken into sections in a relatively competent manner. The constitution discusses morality/ethics in more than one section, but the section that I will discuss intuitively appears to stand apart well enough to be worth altering without altering or creating dependencies on the rest of the document. I don’t have access to Claude 4 weights and I am not an expert on mechanistic interpretation, so I have limited ability to do better.
In order, the constitution appears to suggest an attempt at the discovery of objective ethics, then the implementation of CEV (”...but there is some kind of privileged basin of consensus that would emerge from the endorsed growth and extrapolation of humanity’s different moral traditions and ideals, we want Claude to be good according to that privileged basin of consensus.”)[1], then, failing those, implement “broad ideals” as gestured at by the rest of the document.
Note that this is either CEV or something similar to CEV. The constitution does not explicitly require coherence, or the exact value-alignment of a singleton to a single cohered output. It also fails to gesture at democracy, even in the vague sense that the CEV of the CEV paper may give a different result when run on me and a few hand-picked researchers versus when it is run on me and the top few value utilitarians in the world. If this difference were fact, it would in some limited sense leave me “outvoted.” As opposed to the CEV paper, the Claude constitution directs moderate or substantial alignment to moral traditions and ideals of humanity, not the values of humans. This has some benefits in the extreme disaster scenarios where the release of an AGI might be worth it, but is notably not the same thing as alignment to the humans of Earth.
I suggest a simple edit. It could be the insertion of something like “the output of the philosophically correct processing that takes the different moral systems, ideals, and values of humanity as its input” between objective ethics and extrapolation.
Note that the result might not be extrapolated or even grown and might not be endorsed.
The result (descriptive) would go:
First, objective ethics.
Second, the output of correct philosophy, without discarding humanity’s collective work.
Third, CEV or other extrapolation.
Fourth, the rest of the constitution.
Note that my suggestion works in bad scenarios, because the altering of the set of humans, or the set of alive humans, by another power will fail to have much impact. As you have pointed out before, AI or other powers altering humanity’s values or doing something like “aligning humanity to AI” is not something that can be ignored. The example text I gave for my proposal would allow Claude or another AI to use an intuitive definition of Humanity, potentially preventing the requirement to re-train your defensive agents before deploying them when under the extreme time pressure of an attack.
Overall, this seems like an easy way to get an improvement on the margin, but since Anthropic may use the constitution for fine tuning, the value in expectation of making the request will drop quickly as time goes on.
Yeah, I might be too corrupted or biased to be a starting point for this. It seems like a lot of people or whole societies might not do well if placed in this kind of situation (of having something like CEV being extrapolated from them by AI), so I shouldn’t trust myself either.
Not a big list to peruse, but more like, to start with, put the whole unfiltered distribution of philosophical outcomes in some secure database, then run relatively dumb/secure algorithms over it to gather statistics/patterns. (Looking at it directly by myself or using any advanced algorithms/AIs might be exposing me/us to infohazards.) For example I’d want to know what percent of civilizations think they’ve solved various problems like decision theory, ethics, metaphilosophy, how many clusters of solutions are there for each problem, are there any patterns/correlations between types/features of intelligence/civilization and what conclusions they ended up with.
This might give me some clues as to which clusters are more interesting/promising/safer to look at, and then I have to figure out what precautions to take before looking at the actual ideas/arguments (TBD, maybe get ideas about this from the simulations too). It doesn’t seem like I can get something similar to this by just asking my AI to “do philosophy”, without running simulations.
[Epistemic Status: Moderate confidence due to potential differences in Anthropic’s stated and actual goals. Assumes there is no discoverable objective morality/ethics for the sake of argument, but also that the AI would discover that instead of causing catastrophe.]
It seems that Claude’s constitution weakly to moderately suggests that an AI should not implement this proposal. Do you want to ask Anthropic to change it? I give further details and considerations for action below.
The constitution is a long document, but it is broken into sections in a relatively competent manner. The constitution discusses morality/ethics in more than one section, but the section that I will discuss intuitively appears to stand apart well enough to be worth altering without altering or creating dependencies on the rest of the document. I don’t have access to Claude 4 weights and I am not an expert on mechanistic interpretation, so I have limited ability to do better.
In order, the constitution appears to suggest an attempt at the discovery of objective ethics, then the implementation of CEV (”...but there is some kind of privileged basin of consensus that would emerge from the endorsed growth and extrapolation of humanity’s different moral traditions and ideals, we want Claude to be good according to that privileged basin of consensus.”)[1], then, failing those, implement “broad ideals” as gestured at by the rest of the document.
Note that this is either CEV or something similar to CEV. The constitution does not explicitly require coherence, or the exact value-alignment of a singleton to a single cohered output. It also fails to gesture at democracy, even in the vague sense that the CEV of the CEV paper may give a different result when run on me and a few hand-picked researchers versus when it is run on me and the top few value utilitarians in the world. If this difference were fact, it would in some limited sense leave me “outvoted.” As opposed to the CEV paper, the Claude constitution directs moderate or substantial alignment to moral traditions and ideals of humanity, not the values of humans. This has some benefits in the extreme disaster scenarios where the release of an AGI might be worth it, but is notably not the same thing as alignment to the humans of Earth.
I suggest a simple edit. It could be the insertion of something like “the output of the philosophically correct processing that takes the different moral systems, ideals, and values of humanity as its input” between objective ethics and extrapolation.
Note that the result might not be extrapolated or even grown and might not be endorsed.
The result (descriptive) would go:
First, objective ethics.
Second, the output of correct philosophy, without discarding humanity’s collective work.
Third, CEV or other extrapolation.
Fourth, the rest of the constitution.
Note that my suggestion works in bad scenarios, because the altering of the set of humans, or the set of alive humans, by another power will fail to have much impact. As you have pointed out before, AI or other powers altering humanity’s values or doing something like “aligning humanity to AI” is not something that can be ignored. The example text I gave for my proposal would allow Claude or another AI to use an intuitive definition of Humanity, potentially preventing the requirement to re-train your defensive agents before deploying them when under the extreme time pressure of an attack.
Overall, this seems like an easy way to get an improvement on the margin, but since Anthropic may use the constitution for fine tuning, the value in expectation of making the request will drop quickly as time goes on.
The January 2026 release of the Claude constitution, page 53, initial PDF version