Many individual CEVs are probably quite bad

Viliam6 May 2026 20:18 UTC

97 points

I was thinking about Habryka’s article on Putin’s CEV, but I am posting my response here, because the original article is already 3 weeks old.

I am not sure how exactly a person’s CEV is defined. “If we knew everything and could self-modify” seems potentially sensitive to the precise chronological order of “realizing things” and “self-modification”.

Like, imagine Hitler getting the godlike powers of knowledge and self-control. If he gets the perfect knowledge of economy, sociology and psychology first, he could go like: “Oh, now I realize that the things I blamed on the Jews are actually caused by something else. How embarrassing. No more anti-semitism, but I better erase everyone’s memory first.”

But it is also possible that he gets the self-control first, and he realizes that there is such a thing as value drift, and thinks: “Oh my, this could accidentally make me more similar to the Jews. I better hardcode the Nazi ideals in myself immediately, and also give myself blond hair and blue eyes.” And using the superior knowledge, he hardcodes the Nazi values in himself so that they are reflectively stable and survive all updates.

So, Hitler’s CEV seems to depend on the technical details, in which order he gets the new knowledge and the new skills. He could end up either CEV-nice or a CEV-monster.

Seems like “knowledge first, self-modification next” is the preferred order, but that kinda assumes perfect rationality at the beginning. I mean, perfect knowledge without perfect rationality would probably be prone to confirmation bias and other biases. So we might want perfect rationality (or merely improved rationality) first, but making ourselves more rational is already in the realm of self-modification.

Second, it seems to me that Habryka chooses between two models in the article: Either everyone is CEV-nice, or almost everyone is CEV-nice but a few people such as Putin are rare CEV-monsters. Then he concludes, in my opinion correctly given the premises, that Putin doesn’t seem to be that exceptional. Therefore, he is probably CEV-nice.

(By “CEV-nice” I mean: given godlike powers of knowledge and self-modification, he will ultimately become a benevolent God. There may be a few atrocities in the process, but at some moment he will realize that there are no more threats, and therefore no strategic reasons to treat other people badly. And, getting the strategic reasons out of the way, there are basically no other reasons to hurt people. And by a “CEV-monster” I mean someome who will end up hurting people for non-strategic reasons even after feeling perfectly secure in their godlike powers.)

If this is correct, then I simply reject the premise. I think that although most people are probably “CEV-good”, there are also quite many “CEV-monsters”, i.e. people who value suffering for the sake of suffering (of others). I don’t know how many, but as a very rough estimate, let’s say between 5% and 50%? Under that premise, it doesn’t seem that unlikely that Putin would happen to be one of them. (Or Trump, etc.) I would assume that monsters are over-represented in the positions of power, simply because on the way to the top there are many situations where people have to choose between hurting someone and losing an opportunity to gain more power, so the intrinsically nice are at a disadvantage.

I would also add a third category that I will call “CEV-insane”, someone who after obtaining godlike powers would destroy everything that we value, even for themselves. For example, someone who believes that death gives meaning to life, and the intelligence is a source of misfortune, so he magically establishes a law that everyone will be mortal and of average intelligence. Or a Buddhist who believes that life is suffering, and there is no such thing as “self” anyway, and decides that the omnicide is the right way to go, no more dukkha. Or some eco-fanatic who decides that the homo sapiens or the intelligence itself is a problem, and must be eradicated. Or simply a person who does self-modification wrong, and destroys some of their essential human qualities, while retaining the ability to decide the fate of the universe. I think there are also enough people like this.

I admit that these are very dark ideas, but looking around me, it seems like the world we live in is indeed quite dark. It’s not like people are born good or bad (although, as far as I know, e.g. psychopathy is hereditary to some degree), but more like we move towards some attractors that are self-reinforcing enough to keep us there even after the original forces are gone. Good people keep wishing they remained good, and might even self-modify towards more good, given the tools. But assholes who don’t give a fuck will see no reason to self-modify into someone who gives a fuck. (This is why a group CEV seems like a safer option, if there are some good people in the group, because the good people might choose the good for everyone else, while the assholes might decide that they don’t care either way as long as they are left alone.)

Viliam6 May 2026 20:18 UTC

97 points

30 comments3 min readLW link

Seth Herd 6 May 2026 20:54 UTC
37 points
20
I’ve also been meaning to write about this. I think it’s a surprisingly load-bearing piece of whether to have optimism about AI outcomes on the current path; if we solve alignment, it’s likely to be instruction-following, and human rivalry being what it is, it will probably leave someone fairly power-hungry in charge of the future.

I don’t have time to do it justice now, but a few thoughts:
Locking in your current values so you can’t change your mind later seems like a very weird move. It doesn’t seem like most people would do that.

I certainly agree that “many” people would have negative CEV. Hitler would be a good candidate.

But I’d guess that around 1% of the population has more sadism than empathy in their makeup, once they’re in a zero-pressure situation. There’s a rough estimate of around 10% of the population being “psychopaths” who lack empathy; but like all other neural phenotypes, this probably exists on a spectrum, with zero empathy not really being a thing but people approaching it. See @Dawn Drescher’s impressive work on understanding psychopathy in depth.

I think empathy-minus-sadism (on whatever scales) is the right way to think about people’s CEV. Of course things like locking in current values to prevent drift are possible, but they seem like dumb moves.

Dumb moves seem unlikely when you’ve got an ASI on your side. Surely you’d at least ask it for some thoughts before permanently modifying yourself? You’re already trusting it to do what you say.

Assholes who don’t give a fuck will not want to start giving a fuck, but they will want the freedom to go on living and thinking new thoughts. Those new thoughts may very well cause them to start giving a fuck. I’d expect this process to be cumulative one of having less angry/fearful/defensive thoughts as the habits from a life of pressure and scarcity fade.

I tried looking at historical examples of hereditary rulers who lived pretty pressure-free lives. It seemed like about half of them turned out to be pretty good rulers. The others were captured by the nobles around them and ignored the commoners in favor of their “friends” selfish concerns. I’d find this pretty likely for a while, but having an ASI friend who’s unbiased, and just having more time to look outside of that box, seems likely to break that situation, at least some of the time.

The counterbalance to my relative optimism is the extent to which psychopaths seek and find power. I think they’re overrepresented, but that would only raise my baseline risks to about 10% of getting a god-emperor with more sadism than empathy.

There’s a ton more to say, but I’ll leave it at that for now. This is an important topic that deserves careful analysis. That doesn’t seem to have been done, despite all the philsophy and humanities work on human nature in a scarcity context. So I’m glad to see that started here.
- Viliam 6 May 2026 21:15 UTC
  13 points
  4
  Parent
  Locking in your current values so you can’t change your mind later seems like a very weird move. It doesn’t seem like most people would do that.
  I could imagine e.g. religious people doing that, to make sure they won’t apostatize.
  Heck, I would probably do something similar myself—not exactly locking my current values, but more like setting up some safeguard against a possibility of doing something that my current self would consider abhorrent. Something like “hey, I don’t want to micromanage my future self, but if the future becomes a universe literally full of paperclips, or torture chambers, then something went horribly wrong, and maybe my mind should be reset to its current state and given a chance to reflect on how that possibly happened”.
  It is hard to say what is an acceptable change and what is not. If I was an ancient Roman, I would like to keep myself an option to abolish slavery, no matter how weird it would feel at the start. So I would be like “if for some reason the entire humanity decides to transform themselves into intelligent bunnies, well it seems weird to my current self, my maybe from the perspective of my wiser self it somehow makes perfect sense”. I wouldn’t want to limit my future self at deciding e.g. whether to colonize the galaxies or just build a huge Dyson sphere with a Matrix at the center of our galaxy. I have a preference for letting people freely do whatever they want, but I think it would make sense to limit some harmful options, and I would let my wiser future self to decide on the exact rules. I realize that both making and not making safeguards is a very dangerous option. Would probably err on the side of making the safeguards but making them updatable, like “the future me has to convince the current me to agree on removing the safeguard”, and of course even that is dangerous, if the future me happens to be much smarter than me but also evil. Absolute immovable locks seem wrong, but general conservatism (especially if we get immortality, so the wasted time matters less^[1]) sounds prudent.
  I agree about the rest.
  1. ^
    Even that opinion is controversial.
  - Seth Herd 6 May 2026 23:55 UTC
    6 points
    1
    Parent
    That all makes sense.
    
    That approach wouldn’t (probably) prevent anyone from becoming a better person.
    
    I’ve thought of a similar scheme of rollbacks for any major change: if I implement a change, I’m sandboxed and rolled-back after a little while. Then my current version decides whether to re-implement that change, based on the record of that new trial version’s mindstate and thoughts. I’m not sure if this is better or worse than keeping a copy of your approximately-original self to approve all major belief shifts.
    - Viliam 7 May 2026 7:52 UTC
      8 points
      0
      Parent
      I imagine two kinds of possible disasters with self-modification (actually a scale, but these are the two extremes).
      One is making a mistake, something that sounded good but had a huge horrible side effect that I simply failed to consider. The bad consequences become obvious after one or two iterations; the important thing is to keep checking constantly, to stop it before everything is destroyed.
      Another kind is noise accumulated after thousands of iterations, like in the Murder-Gandhi thought experiment. Each step seems like a reasonable tradeoff, like making myself a little bit more consequentialist, or a little bit more resistant against blackmail, or whatever… but after million iterations I become a psychopath (and on reflection my new self considers that a desirable outcome).
      With the first kind, we would want short sandboxes, to catch the problem early. With the second kind, we would want long sandboxes, to notice the accumulated value drift.
      - Seth Herd 7 May 2026 16:45 UTC
        2 points
        0
        Parent
        I agree that these are legitimate concerns. I think you could avoid a lot of them in this scenario, because you have an ASI you trust to help you foresee and avoid those dangers.
- Linch 7 May 2026 18:41 UTC
  7 points
  2
  Parent
  The others were captured by the nobles around them and ignored the commoners in favor of their “friends” selfish concerns
  How much do you buy this as a correct retelling? “The King is Just and Honorable, but was ‘deceived’ by his advisors” seems like a recurring motif in both fiction and people’s beliefs. So maybe it’s real. But also there are obvious incentives for that narrative, regardless of truth value.
  - gwern 7 May 2026 23:11 UTC
    10 points
    3
    Parent
    It’s worth noting that there is also a reason beyond mere status quo/conformity/propaganda: this is a standard move in ‘top/bottom vs middle’ dynamics in which the peasants really are trying to reach the king in order to coordinate an attack: https://gwern.net/review/book#the-origins-of-political-order-fukuyama-2011
    - Linch 8 May 2026 23:31 UTC
      2 points
      0
      Parent
      Yeah, in addition to your/Fukuyama’s examples, the Culture Revolution was arguably a recent example of this dynamic.
    - Linch 7 May 2026 23:50 UTC
      2 points
      0
      Parent
      Also mentioned in The Prince, of course.
jimmy 7 May 2026 4:40 UTC
8 points
−4
When we think we want to go to the Chinese food restaurant, it’s because we want to eat the food. When we learn that it’s closed and don’t wanna go anymore, that’s not “value drift” to hold off, it’s learning that what we thought we wanted isn’t serving what we really want.
When we think we want to eat the Chinese food, it’s not because of some terminal “It’s just really good”. Get food poisoning once, and you’ll notice that there is indeed accounting for taste. “Yummy” becomes “Yucky” when we realize that the underlying value isn’t being delivered. I’ve even seen things as “innate and fundamental” as sexual orientation shift as it becomes clear that there is/isn’t value to be delivered.
How can you prevent the drift of value, but by sticking one’s head in the sand and refusing to learn that it’s not giving us what we want? How can we want to prevent this “value drift”, once we “know more and think faster” enough to understand the mechanism? What you describe as “getting the self-control first” is making sure to not get the knowledge.
“Realizing things” is not separate from “self-modification”, such that they can come in different orders. Realizing things is what modifies the self. Realizing that the Chinese restaurant is closed modifies your self. Realizing that your spouse has been cheating on you modifies your self. Realizing that Jews aren’t actually what you need them to be changes your self.
The Hitler who updates on what his Jew hatred brought him becomes a different Hitler, regardless of the order, so long as he does the update. Some orders may make it easier to accept the update than others, and therefore might make it more or less likely that he updates, but if he learns that the value is not delivered then he learns the value is not delivered. And if it’s true, and he doesn’t learn, then he has not CE’ed his V.
- Viliam 7 May 2026 8:50 UTC
  5 points
  0
  Parent
  I agree that the process of learning always changes you. Tasting a food you have never tried before switches you from neutral-ish^[1] to either liking the food or disliking the food; either one is a change.
  This process is not linear. For example, there is food (coffee, blue cheese, beer, dry wine...) that tastes terrible on first try to most people, but if you stick with it, sometimes you reprogram your brain to start enjoying it. If you are already reprogrammed, the choice is already made, but if you are a beginner… does it make more sense to respect your current preferences and avoid the food, or to keep eating it anyway because you know this decision will make you happy after you reprogram your brain?
  I guess in real life, many people simply get reprogrammed under social pressure. Stage one, they would prefer not to do it, but they have a stronger preference to be perceived as cool by their environment. Stage two, now they are reprogrammed and there is no conflict.
  But suppose that you are perfectly above all social pressure, and you happen to not like beer at the moment. Is there a good reason for you to choose the reprogramming anyway? Sure, if you do, you will be happy that you did. But if you won’t, you won’t miss it. (Maybe curiosity is the remaining argument for the change? But without the social pressure, there is no reason to be curious about this specific thing.) So it seems to me that there are two different reflectively stable outcomes, and it depends on history where you end up.
  We could even go further; with superior self-modification skills I could give myself a completely arbitrary preference, for example a sexual fetish for triangles. It seems silly now, but after modifying myself, and decorating my palace with paintings of triangles, I would probably be happy that I did it. I may even feel a little curious right now about what such absurd situation would be like. So should I modify myself that way? Seems like the answer is no, but maybe if I had too much time and got bored… well, I also don’t have a strong reason against doing that, except for some general “let’s not do arbitrary self-modifications, even if individually mostly harmless, the cumulative effect might dilute too much the things I care about now”.
  Perhaps good and evil^[2] are also acquired tastes; I am not sure about this part, but it feels plausible.
  “Realizing things” is not separate from “self-modification”, such that they can come in different orders. Realizing things is what modifies the self.
  All new knowledge is self-modification, but not all self-modification is new (external) knowledge. You can also self-modify by resolving internal tensions. Or by changing a random connection in your brain. Also, if you obtain new knowledge in different order, the later information gets interpreted in the light of the former.
  The all-knowing Hitler would know that his original reasons for hating Jews are no longer valid, but he might retain an aesthetic preference for doing so, for example because the very emotion of feeling superior to someone feels enjoyable.
  1. ^
    Even people neutral about a specific food may have a meta preference about experimenting with new foods in general.
  2. ^
    I am not going to provide an exact definition, but I think there is some generic desire for the world to be a nice and happy place also for others, versus just not caring about that and therefore effectively sacrificing the happiness of others for whatever things I do care about.
  - jimmy 7 May 2026 16:45 UTC
    6 points
    0
    Parent
    Thanks for the detailed response
    But suppose that you are perfectly above all social pressure, and you happen to not like beer at the moment. Is there a good reason for you to choose the reprogramming anyway? Sure, if you do, you will be happy that you did. But if you won’t, you won’t miss it. (Maybe curiosity is the remaining argument for the change? But without the social pressure, there is no reason to be curious about this specific thing.) So it seems to me that there are two different reflectively stable outcomes, and it depends on history where you end up.
    In practice, this path dependency thing is indeed important, but it has a lot to do with our tendency to get lost and fail to find coherence.
    For example, instead of beer what about heroin? Needles are icky, but boy will that change if you give it a shot! What happens in the longer term isn’t so simple as “more happy” though, and especially when the effects come from exogenous chemicals, we can’t really trust our initial pleasure to cache out in anything real.
    This can make heroin very risky because people will often fail to learn that heroin injections are yucky again, but that is where the road coheres to. I don’t have any experience with heroin, but I have tried legally prescribed opioids a couple times and went through this arc. After the first time I couldn’t stop thinking about it for a month because it felt so good. Eventually though, my brain kinda recognized that this is not actually a good thing, I don’t actually want it, and when I eventually tried it again “just because” it wasn’t even enjoyable.
    Beer and coffee are a lot more subtle and have context dependent social stuff going on, but “If you do you’ll be glad you did” is far from a sure thing.
    We could even go further; with superior self-modification skills I could give myself a completely arbitrary preference, for example a sexual fetish for triangles. It seems silly now, but after modifying myself, and decorating my palace with paintings of triangles, I would probably be happy that I did it. I may even feel a little curious right now about what such absurd situation would be like. So should I modify myself that way?
    I mean… I kinda vote “yes”. Because curiosity is important, the things you learn from experience are important, and this is a relatively harmless example to experiment with.
    In my experience though, arbitrary modifications like this aren’t very stable. If there isn’t any actual value being delivered, and you don’t try to set up a labyrinth of motivations to not-look, people tend to learn that triangles aren’t so exciting as they had tricked themselves into believing.
    All new knowledge is self-modification, but not all self-modification is new (external) knowledge. You can also self-modify by resolving internal tensions. Or by changing a random connection in your brain.
    A railroad spike to the brain will indeed indeed modify you, and is not well described as “learning new things!”. But resolving internal tensions usually is. And the railroad spike to the brain/random connection change is generally well described as losing things you have learned, or learning (expected) falsehoods.
    Also, if you obtain new knowledge in different order, the later information gets interpreted in the light of the former.
    This is another “probably in practice, but only because we don’t reach coherence” things. Maybe you consider the latter in terms of the former without going back to reinterpret the former in terms of the latter, but Bayes doesn’t justify this failure to propagate updates with any sort of path dependence.
    The all-knowing Hitler would know that his original reasons for hating Jews are no longer valid, but he might retain an aesthetic preference for doing so, for example because the very emotion of feeling superior to someone feels enjoyable.
    My point is that what we think of as inscrutable aesthetic preferences are built upon implicit beliefs about the world, and updating the underlying structure changes the aesthetics. Hitler may like to feel superior, but his potential superiority is itself a fact about reality that he could update on.
    What happens when you sit with the question “Are you superior?”?
    There’s often an impulse to flinch away, and refuse to update, but when you do, things change.
- StanislavKrym 7 May 2026 6:00 UTC
  1 point
  0
  Parent
  Then how does one tell apant the true terminal values and instrumental ones? Does it mean that the CEV of an individual human is likely to be some combo of satisfaction of primitive values, fun-theoretic ones, idiosyncratic ones and of a way to instill decision-theoretic results (like coordinating with others in prisoner-like dilemmas) into our primitive brains? And how would the latter two value types be changeable? How would they change in AIs?
  - jimmy 7 May 2026 18:29 UTC
    2 points
    0
    Parent
    Then how does one tell apant the true terminal values and instrumental ones?
    I don’t know.
    So like, one time I was playing football with my cousins on Thanksgiving, and hurt my foot bad enough that I thought it was broken. As I dug into why reality was different than I wanted, at first there was always something underneath.
    “I wish my foot wasn’t hurting. Of course I wish my foot wasn’t hurting, who wouldn’t? I acknowledge the fact that my foot is hurting. Why is it hurting? Because it’s broken, lol.”
    So then “I wish my foot wasn’t broken”, why’s it broken? Oh, because I was playing football and tripped. Why’d I trip? Shit just happens, man. So, “Of course shit happens, so of course I broke my foot when I was playing football with my family, so of course it hurts.” What’s the problem? Nothing, actually. Suffering resolved.
    In that case, my desire to not be in pain bottomed out at not wanting to have hurt myself unnecessarily, and “shit just happens”, but like… the fact that it was serving as a terminal in this context is still dependent on the fact that I didn’t believe I could do anything about it. If there had been some nootropic that cuts the rate of mistakes to 10%, then “Why did shit happen?” now has an answer of “Because you didn’t improve your brain function, dummy”, and we’re back to the races.
    I know how to get to the practical bottom of things, for a given context. But as some sort of general case where we remove “all” practical limits… I dunno man. I’m not sure the question is well formed. I’m not sure it’s not.
    Does it mean that the CEV of an individual human
    I don’t know where it grounds out, just where it doesn’t. Which is useful on the margin, and maybe even large margins, but in the event of a singularity where limits are removed past the point where we know what to make of it.. it’s still tough to say what that means kinda by definition.
    Friston has some insights which seem relevant to me though. He talks about his “theory of every thing” (humorously distinct from “theory of everything”), and explains that every thing that exists must necessarily resist entropic forces towards disintegration or they would cease to exist. So in that sense, everything that exists seeks to maintain its own existence -- including drops of oil in water.
    Humans obviously tend to actively maintain their own existence, but also lineages of humans on longer timescales. It’s not clear to me what happens when these things conflict and what “lineage” actually means once transhuman stuff becomes possible.
avturchin 7 May 2026 8:14 UTC
7 points
−4
I used to think that “coherent” here includes that it is also coherent with values of other people, so there is no personal CEVs.

Such universal-CEV may include personal-CEVs but implemented in the simulations without CEV-version of “suffering conscious beings”.
- Viliam 7 May 2026 8:57 UTC
  5 points
  1
  Parent
  I guess in personal context it might only mean “consistent”. But yes, great point!
kbear 7 May 2026 2:59 UTC
4 points
−1
I would assume that monsters are over-represented in the positions of power, simply because on the way to the top there are many situations where people have to choose between hurting someone and losing an opportunity to gain more power, so the intrinsically nice are at a disadvantage.
this seems to be smuggling ‘sadistic’ for ‘not unwilling to hurt others in zero-sum games’.
it may be the case that the road to power requires hurting others instrumentally. that does not imply that those who hurt others terminally have an advantage.
“in baseball, you have to run to get on base. therefore, those who love running have an advantage.”
- Martin Randall 7 May 2026 11:44 UTC
  8 points
  1
  Parent
  People who love running have a slight advantage in baseball. They enjoy running so they do more of it so they are better at it. People who love running are slightly over-represented in prominent baseball positions. For similar reasons, people who love playing baseball have an advantage in baseball and are over-represented in prominent baseball positions.
  - kbear 7 May 2026 15:56 UTC
    3 points
    −2
    Parent
    time is finite, and time spent practicing running trades off against more core competencies such as hitting, fielding, pitching, ….
    someone who looks to baseball as an outlet for their running hobby will be very disappointed. people who like running will instead play soccer, track & field, basketball. for this reason, i actually expect enjoying running to be anti-correlated with attained success at baseball^[1].
    looking at the matter at hand, it just obviously proves too much!
    in order to compete in the mlb, one must succeed where others fail.
    sadistic powerseekers like to watch others fail, so they have an advantage here.
    therefore, sadistic powerseekers should be overrepresented in the mlb.
    to be clear, if the argument is instead of the form “sadistic powerseekers find an outlet in positions of power, so seek them out. thus we would expect them to be overrepresented there” then i completely agree: makes sense, seems born out by evidence, no notes.
    ^
    in particular, at the highest levels of the sport.
- Viliam 7 May 2026 12:10 UTC
  7 points
  5
  Parent
  They are not the same thing, but the size of the reward influences the equation.
  The same zero-sum game...
  - ...one person’s potential reward is “I win!” (1 util)
  - ...other person’s potential reward is “I win! also, ahaha, look at those losers crying!” (2 utils)
  Which one is motivated to spend more resources on winning this specific game?
  - kbear 7 May 2026 16:08 UTC
    1 point
    0
    Parent
    for exactly the reason you describe, they will be worse at cooperating in games similar to prisoner’s dilemma. they will not be good at coordinating on the ultimatum game.
    in addition, they will have a worse reputation.
    it’s just not clear to me that valuing your opponent’s loss is long-term favorable across the kinds of decisions that people face as they rise to power.
    Which one is motivated to spend more resources on winning this specific game?
    right, but… should they? rising in politics involves a sequence of victories. burning more resources than is warranted by the zero-th order power you receive seems long term disadvantageous. i would expect players who are motivated solely by power to accumulate more of it, as they will more wisely spend their available capital.
cubefox 7 May 2026 8:33 UTC
3 points
−1
But it is also possible that he gets the self-control first, and he realizes that there is such a thing as value drift, and thinks: “Oh my, this could accidentally make me more similar to the Jews. I better hardcode the Nazi ideals in myself immediately, and also give myself blond hair and blue eyes.” And using the superior knowledge, he hardcodes the Nazi values in himself so that they are reflectively stable and survive all updates.
I disagree with this. Hardcoding formerly instrumental goals to be terminal goals is never useful. Because whether something is useful for you depends on whether it promotes your current terminal goals. Modifying your terminal values would only compete with the existing terminal values. Value drift is only a problem if it changes terminal goals. A mere change in instrumental goals is not a problem. The opposite is the case: we expect our instrumental goals to get better (according to our terminal goals) the wiser we become.
- Steven Byrnes 8 May 2026 14:13 UTC
  2 points
  0
  Parent
  Why do you think that quoted example would be in the category “hardcoding formerly instrumental goals to be terminal goals”, rather than the category “hardcoding terminal goals to prevent them from changing”?
  - cubefox 8 May 2026 18:40 UTC
    2 points
    0
    Parent
    Look at the previous paragraph:
    Like, imagine Hitler getting the godlike powers of knowledge and self-control. If he gets the perfect knowledge of economy, sociology and psychology first, he could go like: “Oh, now I realize that the things I blamed on the Jews are actually caused by something else. How embarrassing. No more anti-semitism, but I better erase everyone’s memory first.”
    This suggested (and I think this is in fact plausible) that Hitler’s hate for Jews was not a terminal value of his, but that he mistakenly believed they were causing various bad outcomes via conspiracy, so he concluded they were evil. Then hardcoding this dislike for Jews would be transforming an instrumental value into a terminal value.
    I also think that “hardcoding terminal goals to prevent them from changing” doesn’t make a lot of sense, since terminal goals are already pretty much as hardcoded as the brain wetware allows. E.g., if you liked lasagna in your 20s, there is a high probability that you still like it in your 60s.
MinusGix 7 May 2026 21:37 UTC
2 points
0

And by a “CEV-monster” I mean someome who will end up hurting people for non-strategic reasons even after feeling perfectly secure in their godlike powers.) If this is correct, then I simply reject the premise. I think that although most people are probably “CEV-good”, there are also quite many “CEV-monsters”, i.e. people who value suffering for the sake of suffering (of others). I don’t know how many, but as a very rough estimate, let’s say between 5% and 50%?

You need to make a distinction between the negative element of a person’s CEV which scales with population (or something similar), and one which does not. Collapsing those confuses about scale.
For example, there’s plausible CEVs wherein someone takes revenge against some particular people or even a whole class of people who were their enemies, but which are otherwise nice. That, I could perhaps believe is 5-50%, though even then I doubt it is that large. Whether ‘they get killed’ or ‘they get tormented forever’. My view is that these worlds are very unfortunate but also still are placed among the most non-destructive sorts of worlds and are pretty good comparatively.

Then, there are those who would instate biblical Hell, but I doubt that is anywhere near 50% even before given knowledge. After knowledge, I have much higher doubt. That is, most religious issues and political specific turns of fate dissolve under truth and reflection and that’s what most of your examples are.
So it becomes a question of what proportion of the population desires large-scale suffering on reflection which to me is <5%. If this wasn’t the case, the world would look very different.

I do agree that monsters are more likely to be in positions of power, which should influence our reticence to give them such power but I feel we have dramatically different mental models of the degree of selection pressure.

I also do agree that some people may have substantially different fundamentals that cut off a lot of value, like possibly internally-coherent Buddhist philosophies which don’t rest on factual observations of the world, but that those are also quite rare. That is, most belief systems have some meaningful referent to facts about the world, or facts about people and thus turn dramatically given proper knowledge.

My view on your first hypothetical value lock-in example is that it is presupposing we messed up implementing CEV. So I don’t really consider that relevant.
If you get knowledge and control, then you can consider methods to more safely lock-in. So perhaps we do get some less-valuable futures due to lock-in stopping truly-better routes from being instantiated, but that I expect a CEV-enhanced individual to be able to consider much better methodologies than naive “ensure I believe X without referring to the nature of the world at all” (ala, your Hitler example; or the religious person enforcing their belief in God). My view is basically that you’re mostly considering “What if an unenhanced human got the power”; rather than “enhanced human” or even “unenhanced human with an AI they can ask for help from”.
Trinley Goldenberg 7 May 2026 16:16 UTC
2 points
0
So, Hitler’s CEV seems to depend on the technical details, in which order he gets the new knowledge and the new skills. He could end up either CEV-nice or a CEV-monster.
This sort of reminds me of Scott Alexander’s Murder Ghandi example. Many moral updates which are stable when extrapolated to the limit, are path dependent when done incrementally.
But it also has a decision-theory dependent flavor to it, it seems like the right decision theory wouldn’t struggle with this.
But that itself seems recursive, since it runs into the same path dependence problem, as you could easily incrementally change your decision theory based on the results of your current decision theory into a bad basin.
Ben Livengood 8 May 2026 1:05 UTC
1 point
0
Globally about ³⁄₄ of humans identify with some religious belief. Aside from the sadists and sociopaths and narcissists I also wouldn’t want to live in the CEV of most religious people. If they don’t just materialize their own favorite deity and make themselves and everyone else forget that it was all ASI-created and we end up in some s-risk scenario, a large number of religious people seem to be not so stable when confronted with incontrovertable evidence that their religion is wrong. Presumably the ASI wouldn’t sugarcoat things. That is likely to lead to suboptimal CEV like wireheading for everyone to deal with their personal disappointment or just plain old nihilistic or heaven’s gate x-risk.
Dacyn 6 May 2026 21:17 UTC
1 point
−2

I think that although most people are probably “CEV-good”, there are also quite many “CEV-monsters”, i.e. people who value suffering for the sake of suffering (of others).

Just to be clear, you are labelling as “CEV-monsters” people who value justice for its own sake, even if/when justice involves some amount of suffering? I don’t think the “monster” label is appropriate, even if you disagree with the position.
- Viliam 6 May 2026 21:22 UTC
  4 points
  0
  Parent
  Depends on details, and I don’t know exactly where to draw the line.
  Justice for the sake of justice, in the Old Testament style “I will place a tasty apple in front of you, and if you fail the marshmallow test, I will make you and your descendants suffer” seems clearly monstrous to me.
  Some kind of “if you hurt others, you will be hurt proportionally” seems completely fair.
  There is a gray area somewhere in between.
  I don’t think that my argument changes substantially depending on how we round up the gray area. Yes, some people will be there, but I expect many to be CEV-monsters from the perspective of us both, even if some of them will consider e.g. “if someone doesn’t believe in my religion, they deserve to be tortured forever” to be a perfectly fair and good rule, and would be horrified by an alternative. (I am not going to be a relativist here and apply the label “good” generously to anyone who believes themselves to be good, regardless of their actions. Existence of a sincere paperclip worshiper would not make tiling the universe with paperclips a good outcome.)
  - Dacyn 6 May 2026 21:57 UTC
    1 point
    0
    Parent
    OK, sounds reasonable.