In my posts, I’ve argued that indexical uncertainty like this shouldn’t be represented using probabilities. Instead, I suggest that you consider yourself to be all of the many copies of you, i.e., both the ones in the ancestor simulations and the one in 2010, making decisions for all of them. Depending on your preferences, you might consider the consequences of the decisions of the copy in 2010 to be the most important and far-reaching, and therefore act mostly as if that was the only copy. [Emphasis added]
In the subsequent 15 years, I’ve upweighted influencing the multiverse through my copies in simulations, relative to base universes, to where they’re about equally important in my mind. For 4 reasons:
Realizing that there are probably universes with vastly greater computational resources than ours, implying there are more simulations containing me than I had thought.
Realizing that there may well be many possible motivations for running simulations, besides interest in ancestors, which was the main motivation talked about back in 2010. This includes acausal trade, curiosity about aliens, science on the distribution of superintelligences in the multiverse, gathering philosophical ideas, entertainment, altruism.
Being fortunate in various ways, on a personal level, that seem more likely if I’m in a simulation.
Doing badly on a civilization level, in a way that devalues trying to influence base reality. I.e, relative to 2010, it looks more likely we’ll undergo the AI transition blindly, without key decision makers paying attention to important philosophical and strategic ideas.
Concretely, this mostly cashes out to me thinking and writing with potential simulators in mind as an additional audience, hoping my ideas might benefit or interest some of them even if they end up largely ignored in this reality.
Concretely, this mostly cashes out to me thinking and writing with potential simulators in mind as an additional audience, hoping my ideas might benefit or interest some of them even if they end up largely ignored in this reality.
What? Aren’t almost all of those simulators enormously more intellectually capable than humans? This seems analogous to an ant trying to write for the benefit of humans.
Are you imagining uncovering ideas that the simulators hadn’t thought of? Or do you imagine that they’ll already have considered anything you think of, but might still be interested to know that you thought of them? Or something else?
Imagine someone (or civilization) who is not very philosophically competent (in an absolute sense, like myself) who somehow got access to a large amount of compute, perhaps by building an intent-aligned AGI who sourced a bunch of compute for them, and this AGI also isn’t very philosophically competent (or they can’t trust it or themselves not to be a clever arguer who could talk them into any conclusion, etc.), then how do you turn some of that compute into philosophical progress?
One idea I have is to simulate a diverse range of civilizations (to help avoid falling into the same cognitive traps) and look at the resulting distribution of philosophical arguments/conclusion, maybe try to sift for ones that are especially competent at philosophy, etc. Does this make sense? @Daniel Kokotajlo
It seems like a civilization that is able to pull that off is operating in a weirdly narrow band of competencies and incompetencies.
It seems surprising that they’re not very philosophically competent while also 1) being reflective enough to know that they’re not philosophically competent (and to know that they can’t trust their AIs), 2) technically capable of securing massive amounts of compute, and 3) wise enough and coordinated enough that no one in that civilization uses that technological capacity to build a sovereign.
I suppose it’s hard to model the distribution of non-human civilizations, but this seems like a pretty improbable conjunction.
I think all sufficiently competent/reflective civilizations (including sovereign AIs) may want to do this, because it seems hard to be certain enough in one’s philosophical competence to not do this as an additional check. The cost of running thousands or even millions of such simulations seem very small compared to potentially wasting the resources of an entire universe/lightcone due to philosophical mistakes. Also they may be running such simulations anyway for other purposes, so it may be essentially free to also gather some philosophical ideas from such simulations, to make sure you didn’t miss something important or got stuck in some cognitive trap.
It seems like you think the ceiling of philosophical competence is very very high, so that even civilizations that are substantially wiser than ours and presumably much more philosophically competent (?) than ours, would not trust their philosophy very much. [1]
That is, they could be “twice” as philosophically competent as us (on some hypothetical reasonable scale), but that’s just still not very much in an absolute sense.
Is there a particular reason why you think that the ceiling is so high?
In general, it seems hard to know which problems are just beyond our grasp and which problems far outstrip our abilities. I could imagine that if I were only a little bit smarter than any human to date, that it wouldn’t be easy for me to solve problems that are currently philosophically fraught me, with robust and verifiable methods.
Or alternatively, do you guess that civilizations with a much higher average intelligence, that are overall displaying more wisdom and coordination than ours, and not also more likely to be philosophically competent?
It seems crazy to me that there’s not a positive correlation between intelligence and philosophical competence or wisdom and philosophical competence.
Interesting. I’d wondered why you wrote so many pieces advising people to be cautious about more esoteric problems arising from AI, to an extent that seemed extremely unlikely to be implemented in the real world, but there being a chance simulators are listening to your arguments does provide an alternative avenue for influence.
I’d wondered why you wrote so many pieces advising people to be cautious about more esoteric problems arising from AI,
Interesting that you have this impression, whereas I’ve been thinking of myself recently as doing a “breadth first search” to uncover high level problems that others seem to have missed or haven’t bothered to write down. I feel like my writings in the last few years are pretty easy to understand without any specialized knowledge (whereas Google says “esoteric” is defined as “intended for or likely to be understood by only a small number of people with a specialized knowledge or interest”).
If on reflection you still think “esoteric” is right, I’d be interested in an expansion on this, e.g. which of the problems I’ve discussed seem esoteric to you and why.
to an extent that seemed extremely unlikely to be implemented in the real world
It doesn’t look like humanity is on track to handle these problems, but “extremely unlikely” seems like an overstatement. I think there’s still some paths where we handle these problems better, including 1) warning shots or political wind shift cause an AI pause/stop to be implemented, during which some of these problems/ideas are popularized or rediscovered 2) future AI advisors are influenced by my writings or are strategically competent enough to realize these same problems and help warn/convince their principals.
I also have other motivations including:
status—Recognition even among a small group can be highly motivating for humans.
intellectual curiosity—Think of it as “theoretical Singularity strategic studies”. Sure seems more interesting than many other intellectual puzzles that people pursue.
dignity—Even if a few humans can see things more clearly, that’s more dignified than going into the AI transition completely blind.
Yeah “esoteric” perhaps isn’t the best word. What I had in mind is that they’re relatively more esoteric than “AI could kill us all” and yet it’s pretty hard to get people to take even that seriously! “Low-propensity-to-persuade-people” maybe?
but “extremely unlikely” seems like an overstatement[...]
What I had in mind is that they’re relatively more esoteric than “AI could kill us all” and yet it’s pretty hard to get people to take even that seriously! “Low-propensity-to-persuade-people” maybe?
Yeah, that makes sense. I guess I’ve been using “illegible” for a similar purpose, but maybe that’s not a great word either, because that also seems to imply “hard to understand” but again it seems like these problems I’ve been writing about are not that hard to understand.
I wish I knew what is causing people to ignore these issues, including people in rationality/EA (e.g. the most famous rationalists have said little on them). I may be slowly growing an audience, e.g. Will MacAskill invited me to do a podcast with his org, and Jan Kulveit just tweeted “@weidai11
is completely right about the risk we won’t be philosophically competent enough in time”, but it’s inexplicable to me how slow it has been, compared to something like UDT which instantly became “the talk of the town” among rationalists.
Pretty plausible that the same underlying mechanism is also causing the general public to not take “AI could kill us all” very seriously, and I wish I understood that better as well.
Interesting idea. So suppose your intent-aligned AGI has taken over the world on your behalf and built the Dyson swarms etc., and it’s patiently awaiting your orders for whether to upload you, what to tell the von Neumann probes to do, whether and how to do acausal trade, etc. You have enough wisdom to strongly suspect that for a wide range of philosophical questions, the AGI could convince you of multiple different answers to those questions if it tried, even if it restricted itself to only saying things it believed to be true. Moreover, on at least one or two occasions the AGI has demonstrated what seems to you to be poor philosophical judgment—e.g. maybe you asked it for its views on ethics, and it said “all that matters is the minimization of suffering” and then you probed further and it turns out “suffering” basically means “training loss” and you ask it to explain the reasoning that led to this conclusion and it says “simplicity heuristic + intuition” and you are like “OK but… I don’t share your intuitions… and also, I don’t see why simplicity should matter so much...” and then you stopped yourself from asking further questions because you are worried it’ll just be way too persuasive.
The classic idea from Yudkowsky, Christiano, etc. for what to do in a situation like this is to go meta: Ask the AI to predict what you’d conclude if you were a bit smarter, had more time to think, etc. Insofar as you’d conclude different things depending on the initial conditions, the AI should explain what and why.
You, Wei, are proposing another plan: Ask the AI to simulate thousands of civilizations, and then search over those civilizations for examples of people doing philosophical reasoning of the sort that might appeal to you, and then present it all to you in a big list for you to peruse?
If you can trust the AI to pick out the examples of reasoning that might appeal to you, can’t you also trust it to just do a bunch of that kind of reasoning itself? “Please do ten thousand years of philosophy, but only use reasoning styles that I’d recognize as valid / sound / appropriate. Present to me your conclusions.”
...got to go for a bit, might return to this later, don’t let my absence stop you from replying though.
The classic idea from Yudkowsky, Christiano, etc. for what to do in a situation like this is to go meta: Ask the AI to predict what you’d conclude if you were a bit smarter, had more time to think, etc. Insofar as you’d conclude different things depending on the initial conditions, the AI should explain what and why.
Yeah, I might be too corrupted or biased to be a starting point for this. It seems like a lot of people or whole societies might not do well if placed in this kind of situation (of having something like CEV being extrapolated from them by AI), so I shouldn’t trust myself either.
You, Wei, are proposing another plan: Ask the AI to simulate thousands of civilizations, and then search over those civilizations for examples of people doing philosophical reasoning of the sort that might appeal to you, and then present it all to you in a big list for you to peruse?
Not a big list to peruse, but more like, to start with, put the whole unfiltered distribution of philosophical outcomes in some secure database, then run relatively dumb/secure algorithms over it to gather statistics/patterns. (Looking at it directly by myself or using any advanced algorithms/AIs might be exposing me/us to infohazards.) For example I’d want to know what percent of civilizations think they’ve solved various problems like decision theory, ethics, metaphilosophy, how many clusters of solutions are there for each problem, are there any patterns/correlations between types/features of intelligence/civilization and what conclusions they ended up with.
This might give me some clues as to which clusters are more interesting/promising/safer to look at, and then I have to figure out what precautions to take before looking at the actual ideas/arguments (TBD, maybe get ideas about this from the simulations too). It doesn’t seem like I can get something similar to this by just asking my AI to “do philosophy”, without running simulations.
[Epistemic Status: Moderate confidence due to potential differences in Anthropic’s stated and actual goals. Assumes there is no discoverable objective morality/ethics for the sake of argument, but also that the AI would discover that instead of causing catastrophe.]
It seems that Claude’s constitution weakly to moderately suggests that an AI should not implement this proposal. Do you want to ask Anthropic to change it? I give further details and considerations for action below.
The constitution is a long document, but it is broken into sections in a relatively competent manner. The constitution discusses morality/ethics in more than one section, but the section that I will discuss intuitively appears to stand apart well enough to be worth altering without altering or creating dependencies on the rest of the document. I don’t have access to Claude 4 weights and I am not an expert on mechanistic interpretation, so I have limited ability to do better.
In order, the constitution appears to suggest an attempt at the discovery of objective ethics, then the implementation of CEV (”...but there is some kind of privileged basin of consensus that would emerge from the endorsed growth and extrapolation of humanity’s different moral traditions and ideals, we want Claude to be good according to that privileged basin of consensus.”)[1], then, failing those, implement “broad ideals” as gestured at by the rest of the document.
Note that this is either CEV or something similar to CEV. The constitution does not explicitly require coherence, or the exact value-alignment of a singleton to a single cohered output. It also fails to gesture at democracy, even in the vague sense that the CEV of the CEV paper may give a different result when run on me and a few hand-picked researchers versus when it is run on me and the top few value utilitarians in the world. If this difference were fact, it would in some limited sense leave me “outvoted.” As opposed to the CEV paper, the Claude constitution directs moderate or substantial alignment to moral traditions and ideals of humanity, not the values of humans. This has some benefits in the extreme disaster scenarios where the release of an AGI might be worth it, but is notably not the same thing as alignment to the humans of Earth.
I suggest a simple edit. It could be the insertion of something like “the output of the philosophically correct processing that takes the different moral systems, ideals, and values of humanity as its input” between objective ethics and extrapolation.
Note that the result might not be extrapolated or even grown and might not be endorsed.
The result (descriptive) would go:
First, objective ethics.
Second, the output of correct philosophy, without discarding humanity’s collective work.
Third, CEV or other extrapolation.
Fourth, the rest of the constitution.
Note that my suggestion works in bad scenarios, because the altering of the set of humans, or the set of alive humans, by another power will fail to have much impact. As you have pointed out before, AI or other powers altering humanity’s values or doing something like “aligning humanity to AI” is not something that can be ignored. The example text I gave for my proposal would allow Claude or another AI to use an intuitive definition of Humanity, potentially preventing the requirement to re-train your defensive agents before deploying them when under the extreme time pressure of an attack.
Overall, this seems like an easy way to get an improvement on the margin, but since Anthropic may use the constitution for fine tuning, the value in expectation of making the request will drop quickly as time goes on.
You, Kokotajlo, not immediately dismissing the idea is “evidence” to the extent that you stand in for AI researchers that might make the decision. In quotes because a logically omniscient (e.g. perfect Bayesian) agent would presumably already have a good guess and not update much if at all. On the other hand, agents with (small) finite compute can run experiments or otherwise observe events and use the results to improve their “mathematical intuition” that is then used in a similar way to the “mathematical intuition module” in UDT, except with the sacrifice of full (logical) updatelessness.
Depending on how Wei Dai thinks his anthropics works, he may be able to use this mechanism to increase his estimate of the instantaneous “probability” that he is in a simulation produced by the process required to do automated philosophical research. This would work by modeling hypothetical outside-the-simulation AI researchers as functions that approximate a (pure) match tree that returns a non-dismissive response when parameterized with a similar textual description of this alignment idea. It may not be in the same language, or in the context of a discussion of a history that looks like it is going to fail to establish proper alignment, however.
It’s actually worse, in that almost all such simulators are hypercomputational and have literally infinite compute, memory and data, as well as their programs being infinitely large, so it’s literally useless to think/write with potential simulators as an audience.
I mean, that doesn’t seem quite right. If it was literally useless, they wouldn’t be running the simulation at all. Presumably the simulation is part of their infinite thought process.
Realizing that there are probably universes with vastly greater computational resources than ours, implying there are more simulations containing me than I had thought.
What made you believe that?
I find it hard to even conceptualize how to think through something like that, including the anthropics, which computationally powerful universes to admit, &c.
My intuition is that allowing universes with hypercomputation puts us in a dovetailer being run almost surely somewhere in the most computationally powerful universes, but that this all introduces a ton of difficulties into reasoning about the multiverse and our position inside of it.
The same argument that shows a base universe may be computationally richer than our universe (and at least cannot be less computationally rich), also greatly limits the number of simulated universes there could be. The (third branch of the) simulation hypothesis, which posits a very large number of simulated universes ultimately stemming from a single base universe, basically relies on the fact that you do not need at least X bits of information in the base universe to simulate a universe of X bits. If you add that restriction, which I’d say you should, then the whole idea falls apart and the idea you’re living in a simulation is no longer a certainty. At that point, you’re limited to just the regular amount of bits in the base universe for running minds or whatever.
if we assume the base universe looks something like the “objective” version of this universe, then my subjective experience requires vastly less information than the base universe. much of that could be deduplicated between other variations: the positions of the asteroids only need to be simulated once, for instance.
the assumption seems decent to me, as i expect the simulators to dream of variations on their own circumstances.
An update on this 2010 position of mine, which seems to have become conventional wisdom on LW:
In the subsequent 15 years, I’ve upweighted influencing the multiverse through my copies in simulations, relative to base universes, to where they’re about equally important in my mind. For 4 reasons:
Realizing that there are probably universes with vastly greater computational resources than ours, implying there are more simulations containing me than I had thought.
Realizing that there may well be many possible motivations for running simulations, besides interest in ancestors, which was the main motivation talked about back in 2010. This includes acausal trade, curiosity about aliens, science on the distribution of superintelligences in the multiverse, gathering philosophical ideas, entertainment, altruism.
Being fortunate in various ways, on a personal level, that seem more likely if I’m in a simulation.
Doing badly on a civilization level, in a way that devalues trying to influence base reality. I.e, relative to 2010, it looks more likely we’ll undergo the AI transition blindly, without key decision makers paying attention to important philosophical and strategic ideas.
Concretely, this mostly cashes out to me thinking and writing with potential simulators in mind as an additional audience, hoping my ideas might benefit or interest some of them even if they end up largely ignored in this reality.
What? Aren’t almost all of those simulators enormously more intellectually capable than humans? This seems analogous to an ant trying to write for the benefit of humans.
Are you imagining uncovering ideas that the simulators hadn’t thought of? Or do you imagine that they’ll already have considered anything you think of, but might still be interested to know that you thought of them? Or something else?
Imagine someone (or civilization) who is not very philosophically competent (in an absolute sense, like myself) who somehow got access to a large amount of compute, perhaps by building an intent-aligned AGI who sourced a bunch of compute for them, and this AGI also isn’t very philosophically competent (or they can’t trust it or themselves not to be a clever arguer who could talk them into any conclusion, etc.), then how do you turn some of that compute into philosophical progress?
One idea I have is to simulate a diverse range of civilizations (to help avoid falling into the same cognitive traps) and look at the resulting distribution of philosophical arguments/conclusion, maybe try to sift for ones that are especially competent at philosophy, etc. Does this make sense? @Daniel Kokotajlo
It seems like a civilization that is able to pull that off is operating in a weirdly narrow band of competencies and incompetencies.
It seems surprising that they’re not very philosophically competent while also 1) being reflective enough to know that they’re not philosophically competent (and to know that they can’t trust their AIs), 2) technically capable of securing massive amounts of compute, and 3) wise enough and coordinated enough that no one in that civilization uses that technological capacity to build a sovereign.
I suppose it’s hard to model the distribution of non-human civilizations, but this seems like a pretty improbable conjunction.
I think all sufficiently competent/reflective civilizations (including sovereign AIs) may want to do this, because it seems hard to be certain enough in one’s philosophical competence to not do this as an additional check. The cost of running thousands or even millions of such simulations seem very small compared to potentially wasting the resources of an entire universe/lightcone due to philosophical mistakes. Also they may be running such simulations anyway for other purposes, so it may be essentially free to also gather some philosophical ideas from such simulations, to make sure you didn’t miss something important or got stuck in some cognitive trap.
It seems like you think the ceiling of philosophical competence is very very high, so that even civilizations that are substantially wiser than ours and presumably much more philosophically competent (?) than ours, would not trust their philosophy very much. [1]
That is, they could be “twice” as philosophically competent as us (on some hypothetical reasonable scale), but that’s just still not very much in an absolute sense.
Is there a particular reason why you think that the ceiling is so high?
In general, it seems hard to know which problems are just beyond our grasp and which problems far outstrip our abilities. I could imagine that if I were only a little bit smarter than any human to date, that it wouldn’t be easy for me to solve problems that are currently philosophically fraught me, with robust and verifiable methods.
Or alternatively, do you guess that civilizations with a much higher average intelligence, that are overall displaying more wisdom and coordination than ours, and not also more likely to be philosophically competent?
It seems crazy to me that there’s not a positive correlation between intelligence and philosophical competence or wisdom and philosophical competence.
Interesting. I’d wondered why you wrote so many pieces advising people to be cautious about more esoteric problems arising from AI, to an extent that seemed extremely unlikely to be implemented in the real world, but there being a chance simulators are listening to your arguments does provide an alternative avenue for influence.
Interesting that you have this impression, whereas I’ve been thinking of myself recently as doing a “breadth first search” to uncover high level problems that others seem to have missed or haven’t bothered to write down. I feel like my writings in the last few years are pretty easy to understand without any specialized knowledge (whereas Google says “esoteric” is defined as “intended for or likely to be understood by only a small number of people with a specialized knowledge or interest”).
If on reflection you still think “esoteric” is right, I’d be interested in an expansion on this, e.g. which of the problems I’ve discussed seem esoteric to you and why.
It doesn’t look like humanity is on track to handle these problems, but “extremely unlikely” seems like an overstatement. I think there’s still some paths where we handle these problems better, including 1) warning shots or political wind shift cause an AI pause/stop to be implemented, during which some of these problems/ideas are popularized or rediscovered 2) future AI advisors are influenced by my writings or are strategically competent enough to realize these same problems and help warn/convince their principals.
I also have other motivations including:
status—Recognition even among a small group can be highly motivating for humans.
intellectual curiosity—Think of it as “theoretical Singularity strategic studies”. Sure seems more interesting than many other intellectual puzzles that people pursue.
dignity—Even if a few humans can see things more clearly, that’s more dignified than going into the AI transition completely blind.
Yeah “esoteric” perhaps isn’t the best word. What I had in mind is that they’re relatively more esoteric than “AI could kill us all” and yet it’s pretty hard to get people to take even that seriously! “Low-propensity-to-persuade-people” maybe?
Yes this is fair.
Yeah, that makes sense. I guess I’ve been using “illegible” for a similar purpose, but maybe that’s not a great word either, because that also seems to imply “hard to understand” but again it seems like these problems I’ve been writing about are not that hard to understand.
I wish I knew what is causing people to ignore these issues, including people in rationality/EA (e.g. the most famous rationalists have said little on them). I may be slowly growing an audience, e.g. Will MacAskill invited me to do a podcast with his org, and Jan Kulveit just tweeted “@weidai11 is completely right about the risk we won’t be philosophically competent enough in time”, but it’s inexplicable to me how slow it has been, compared to something like UDT which instantly became “the talk of the town” among rationalists.
Pretty plausible that the same underlying mechanism is also causing the general public to not take “AI could kill us all” very seriously, and I wish I understood that better as well.
Interesting idea. So suppose your intent-aligned AGI has taken over the world on your behalf and built the Dyson swarms etc., and it’s patiently awaiting your orders for whether to upload you, what to tell the von Neumann probes to do, whether and how to do acausal trade, etc. You have enough wisdom to strongly suspect that for a wide range of philosophical questions, the AGI could convince you of multiple different answers to those questions if it tried, even if it restricted itself to only saying things it believed to be true. Moreover, on at least one or two occasions the AGI has demonstrated what seems to you to be poor philosophical judgment—e.g. maybe you asked it for its views on ethics, and it said “all that matters is the minimization of suffering” and then you probed further and it turns out “suffering” basically means “training loss” and you ask it to explain the reasoning that led to this conclusion and it says “simplicity heuristic + intuition” and you are like “OK but… I don’t share your intuitions… and also, I don’t see why simplicity should matter so much...” and then you stopped yourself from asking further questions because you are worried it’ll just be way too persuasive.
The classic idea from Yudkowsky, Christiano, etc. for what to do in a situation like this is to go meta: Ask the AI to predict what you’d conclude if you were a bit smarter, had more time to think, etc. Insofar as you’d conclude different things depending on the initial conditions, the AI should explain what and why.
You, Wei, are proposing another plan: Ask the AI to simulate thousands of civilizations, and then search over those civilizations for examples of people doing philosophical reasoning of the sort that might appeal to you, and then present it all to you in a big list for you to peruse?
If you can trust the AI to pick out the examples of reasoning that might appeal to you, can’t you also trust it to just do a bunch of that kind of reasoning itself? “Please do ten thousand years of philosophy, but only use reasoning styles that I’d recognize as valid / sound / appropriate. Present to me your conclusions.”
...got to go for a bit, might return to this later, don’t let my absence stop you from replying though.
Yeah, I might be too corrupted or biased to be a starting point for this. It seems like a lot of people or whole societies might not do well if placed in this kind of situation (of having something like CEV being extrapolated from them by AI), so I shouldn’t trust myself either.
Not a big list to peruse, but more like, to start with, put the whole unfiltered distribution of philosophical outcomes in some secure database, then run relatively dumb/secure algorithms over it to gather statistics/patterns. (Looking at it directly by myself or using any advanced algorithms/AIs might be exposing me/us to infohazards.) For example I’d want to know what percent of civilizations think they’ve solved various problems like decision theory, ethics, metaphilosophy, how many clusters of solutions are there for each problem, are there any patterns/correlations between types/features of intelligence/civilization and what conclusions they ended up with.
This might give me some clues as to which clusters are more interesting/promising/safer to look at, and then I have to figure out what precautions to take before looking at the actual ideas/arguments (TBD, maybe get ideas about this from the simulations too). It doesn’t seem like I can get something similar to this by just asking my AI to “do philosophy”, without running simulations.
[Epistemic Status: Moderate confidence due to potential differences in Anthropic’s stated and actual goals. Assumes there is no discoverable objective morality/ethics for the sake of argument, but also that the AI would discover that instead of causing catastrophe.]
It seems that Claude’s constitution weakly to moderately suggests that an AI should not implement this proposal. Do you want to ask Anthropic to change it? I give further details and considerations for action below.
The constitution is a long document, but it is broken into sections in a relatively competent manner. The constitution discusses morality/ethics in more than one section, but the section that I will discuss intuitively appears to stand apart well enough to be worth altering without altering or creating dependencies on the rest of the document. I don’t have access to Claude 4 weights and I am not an expert on mechanistic interpretation, so I have limited ability to do better.
In order, the constitution appears to suggest an attempt at the discovery of objective ethics, then the implementation of CEV (”...but there is some kind of privileged basin of consensus that would emerge from the endorsed growth and extrapolation of humanity’s different moral traditions and ideals, we want Claude to be good according to that privileged basin of consensus.”)[1], then, failing those, implement “broad ideals” as gestured at by the rest of the document.
Note that this is either CEV or something similar to CEV. The constitution does not explicitly require coherence, or the exact value-alignment of a singleton to a single cohered output. It also fails to gesture at democracy, even in the vague sense that the CEV of the CEV paper may give a different result when run on me and a few hand-picked researchers versus when it is run on me and the top few value utilitarians in the world. If this difference were fact, it would in some limited sense leave me “outvoted.” As opposed to the CEV paper, the Claude constitution directs moderate or substantial alignment to moral traditions and ideals of humanity, not the values of humans. This has some benefits in the extreme disaster scenarios where the release of an AGI might be worth it, but is notably not the same thing as alignment to the humans of Earth.
I suggest a simple edit. It could be the insertion of something like “the output of the philosophically correct processing that takes the different moral systems, ideals, and values of humanity as its input” between objective ethics and extrapolation.
Note that the result might not be extrapolated or even grown and might not be endorsed.
The result (descriptive) would go:
First, objective ethics.
Second, the output of correct philosophy, without discarding humanity’s collective work.
Third, CEV or other extrapolation.
Fourth, the rest of the constitution.
Note that my suggestion works in bad scenarios, because the altering of the set of humans, or the set of alive humans, by another power will fail to have much impact. As you have pointed out before, AI or other powers altering humanity’s values or doing something like “aligning humanity to AI” is not something that can be ignored. The example text I gave for my proposal would allow Claude or another AI to use an intuitive definition of Humanity, potentially preventing the requirement to re-train your defensive agents before deploying them when under the extreme time pressure of an attack.
Overall, this seems like an easy way to get an improvement on the margin, but since Anthropic may use the constitution for fine tuning, the value in expectation of making the request will drop quickly as time goes on.
The January 2026 release of the Claude constitution, page 53, initial PDF version
You, Kokotajlo, not immediately dismissing the idea is “evidence” to the extent that you stand in for AI researchers that might make the decision. In quotes because a logically omniscient (e.g. perfect Bayesian) agent would presumably already have a good guess and not update much if at all. On the other hand, agents with (small) finite compute can run experiments or otherwise observe events and use the results to improve their “mathematical intuition” that is then used in a similar way to the “mathematical intuition module” in UDT, except with the sacrifice of full (logical) updatelessness.
Depending on how Wei Dai thinks his anthropics works, he may be able to use this mechanism to increase his estimate of the instantaneous “probability” that he is in a simulation produced by the process required to do automated philosophical research. This would work by modeling hypothetical outside-the-simulation AI researchers as functions that approximate a (pure) match tree that returns a non-dismissive response when parameterized with a similar textual description of this alignment idea. It may not be in the same language, or in the context of a discussion of a history that looks like it is going to fail to establish proper alignment, however.
The match tree in (abbreviated) placeholder code:
(Note that the purity requirement here applies to everything the code abbreviated with dots calls as well.)
The mathematical function that approximates the match tree: ResearcherFunction := FunApprox(ai_researcher_outside)
It’s actually worse, in that almost all such simulators are hypercomputational and have literally infinite compute, memory and data, as well as their programs being infinitely large, so it’s literally useless to think/write with potential simulators as an audience.
I mean, that doesn’t seem quite right. If it was literally useless, they wouldn’t be running the simulation at all. Presumably the simulation is part of their infinite thought process.
What made you believe that?
I find it hard to even conceptualize how to think through something like that, including the anthropics, which computationally powerful universes to admit, &c.
My intuition is that allowing universes with hypercomputation puts us in a dovetailer being run almost surely somewhere in the most computationally powerful universes, but that this all introduces a ton of difficulties into reasoning about the multiverse and our position inside of it.
Yeah, my intuition is similar to yours, and it seems very difficult to reason about all of this. That just represents my best guess.
The same argument that shows a base universe may be computationally richer than our universe (and at least cannot be less computationally rich), also greatly limits the number of simulated universes there could be. The (third branch of the) simulation hypothesis, which posits a very large number of simulated universes ultimately stemming from a single base universe, basically relies on the fact that you do not need at least X bits of information in the base universe to simulate a universe of X bits. If you add that restriction, which I’d say you should, then the whole idea falls apart and the idea you’re living in a simulation is no longer a certainty. At that point, you’re limited to just the regular amount of bits in the base universe for running minds or whatever.
if we assume the base universe looks something like the “objective” version of this universe, then my subjective experience requires vastly less information than the base universe. much of that could be deduplicated between other variations: the positions of the asteroids only need to be simulated once, for instance.
the assumption seems decent to me, as i expect the simulators to dream of variations on their own circumstances.