Vladimir Putin’s CEV is probably not that bad

(Written quickly for Inkhaven, I hope someone someday makes a better case for this than I will here)

Kelsey Piper on Twitter:

me: it’s not okay to hit your sister
5yo: is it okay to kill Vladimir Putin?
me: …yes, if you were in a situation where it was somehow relevant it’s okay to kill Vladimir Putin
5yo: well, my sister is WORSE than Vladimir Putin

Now, I do think Vladimir Putin is probably a pretty bad man all things considered. I personally am sympathetic to the current equilibria among major nation states to not assassinate leaders of foreign nations, so I am not actually sure whether it would be okay for Kelsey’s 5-year old to kill Vladimir Putin, but I am pretty on board with thinking he has done some pretty terrible things, and probably lacks important aspects of a good moral compass.

But in AI discussions, I often see this concern extended into a much stronger statement: “Even if Vladimir Putin had all the things he wanted in the world, and was under no pressure to maintain his control over Russia, and could choose to make himself smarter and wiser, and could learn any fact he wanted, get the result of any experiment he was interested in, then Vladimir Putin would still do terrible things with the world” (this process being known as “Coherent Extrapolated Volition”).

My guess is much of the belief of Putin’s depravity in such a situation, is downstream of a mixture of social dynamics reinforcing negative judgements about political enemies, as well as a devils horns effect where evil people must be evil in all ways, instead of just some ways.

The success of true crime podcasts, notorious for overstating the depravity with which the people they cover acted, or the fanaticism behind the crimes committed, illustrates the most common errors here. There is a common social attractor of really wanting license to declare someone the outgroup, and to have permission to extend them no care or be cruel to them.

While I do buy a correlation between ending up in a powerful leadership position in an autocratic country and being evil, most of the bits of selection of what kind of person ends up in that kind of position must go into competence, not various correlates of evil.

And it’s far too common for people to believe the leaders of opposing nations are evil, while their own leaders are just. So at the onset, we should expect people to strongly overestimate how evil powerful people in foreign social groups, institutions or countries are. And if someone is evil in one way, yes, they will probably also be more likely to be evil in some other ways, but not all other ways, especially ways that are much more removed from our intuitions about people, like how someone would behave after enormous amounts of cognitive reflection.

But that still leaves a non-trivial correlation between potentially relevant evil tendencies and power. This create a cause for concern that various powerful people around the world might really mess up the future if put in a position to do so. And while I don’t think I have great answers to all concerns, I think some common ones I’ve heard are weak and can be addressed.

To be clear, I’m not arguing from moral realism. I don’t think all minds, as they get smarter and wiser, and have their basic needs fulfilled, converge. Most animals and most AI systems, empowered this way, would end up at quite distant parts of the value landscape.

Possibly even humans radically diverge from each other too, as they reflect and change themselves.

What I’m objecting to is the claim that the traits we associate with evil (being a dictator, a ruthless CEO, a scammer) make someone so bad at the reflection process that their extrapolated output would be worse than what you’d get by extrapolating a random non-human mammal, or a current LLM like Claude or ChatGPT[1].

And so I see people propose things like “American AI must be built before aligned Chinese AI,” preferring a US-led AI over slowing down and risking China aligning systems to Xi Jinping’s values. Of course I’d rather have an AI aligned to my own values, and of course the game theory of how to navigate a situation like this is tricky, but I think this is a game that is much better to be won by someone, rather than no one.


I don’t have a confident model of when someone’s moral extrapolation will come out good or bad. But my best guess is that the vast majority of humans, including those we’d call bad actors, would want to create a world full of flourishing, fulfilled beings — happy in specifically human ways, telling stories that are interesting the way human stories are. Maybe those beings will be copies of whoever’s values got extrapolated, maybe children of them, maybe strange new minds that still carry their spark of humanity.

Putin has friends too! So does Xi Jinping, and so do almost all other powerful people in history, evil or not. Their days are probably mostly filled with mundane concerns and mundane preferences, of the kind that are reflective of what it’s like to be human. They almost certainly have people they love and wish well and would like to empower, and a sense of beauty shared with most humans. In as much as they are patriotic they would like to see their country prosper, and its values propagated.


A common belief I have encountered is that people are mostly evil by choice. I think that’s true in a small minority of cases, but my best guess is that evil in the world is mostly driven by the kind of dynamics outlined in the Dictator’s Handbook.

A lot of what looks like “evil values” in leaders is really a selection effect: once you’re at the top of a small-coalition regime, keeping power requires doing specific nasty things. Buying off cronies, crushing rivals, suppressing the base, regardless of what you’d personally want.

“Putin gets to do whatever he actually wants, free of the need to stay in power” is importantly different from “more-of-Putin-with-more-power.” I am pretty sure Putin doesn’t love the authoritarian regime intrinsically. He probably doesn’t love the posturing and the lying and having to dispose of the generals trying to overthrow him, and needing to fake elections and all the terrible things he probably needs to do to stay in power.

He probably does love the adoration and the respect he gets to demand, but those do not require (and my guess is are probably mildly harmed) by the suffering of his admirers.


Another hypothesis is that people are worried that if you are not careful, you might accidentally, by your values, tile the universe with suffering subroutines. Recreate the equivalent of factory farming as a byproduct of optimizing the cosmos.

I think those people don’t appreciate the high-dimensionality of value enough. Insofar as any set of values involves creating algorithms for a purpose, my guess is those algorithms will be such extreme instances of that purpose that they won’t have high-level qualities like “self-awareness” or “suffering.”

The ideal cow for meat production isn’t sentient, it’s a pile of fat and muscle cells growing on their own, or more likely an industrial process akin to a manufacturing plant. Similarly, the ideal algorithm for any purpose won’t suffer. Suffering (probably) exists because it filled an evolutionary purpose; a mind constructed from scratch for a different purpose wouldn’t inherit that circuitry.

And even if suffering did show up in the optimal algorithm for some goal, it would take only cosmically minuscule amounts of caring-about-suffering to route around it, and a complete absence of that in humans with intact minds seems unlikely.


But the strongest argument I’ve heard is that some of these people would use their resources to actively torture some idealized version of their enemies for all eternity.

And yeah, that does seem pretty bad.

But in order for this to end up being bad in a way that outweighs the good they will likely create, you need to be actively creating new people to torture.

If you really hate Bob, you can keep Bob on old earth, tortured for eternity. If you have thousands of enemies, you can do that to all of them. But creating trillions of copies of Bob to torture requires a very specific mix of taking an opinionated bet on decision-theory while taking an oddly enlightened perspective on other people’s values.

Some people’s minds are plausibly shaped such that they would destroy the future this way — but my guess is this requires fanatical dedication to a belief system or vision, of the kind that isn’t compatible with actively being in power. People in power are often corrupt, but their highly competitive positions can’t afford much brokenness in the minds that occupy them. Those minds have to be largely intact to do the job, which screens off many of the worst outcomes.

There are other nearby hypotheses about what could happen that involve creating suffering people, such as creating admirers, or countries to conquer, or things other than “I want these specific people to suffer immensely”.

Those seem more plausible to me as causes of a squandered future, though I think many of them run into the “hyper-optimized cow” objection. If you only care for other people for highly instrumental reasons, such as to give you admiration, or be the ideal person to defeat and humiliate in battle, my guess is the extremely optimized versions of those minds will not leave that much of the cognitive architecture for suffering intact.

Clearly, there can exist minds that truly at the bottom of the heart and after however much reflection they want to undergo, want to relate to other people in a way that involves the full depth and complexity of suffering on the other side. The arguments here are not about them being impossible, they are about them being rare enough, that overall, for any individual, things will very likely turn out to be fine, even if you think of them as canonically evil.


To be clear, this doesn’t mean it’s unimportant to get broad representation into something like a CEV process. Putin’s values getting extrapolated isn’t as good for me, as getting my own values extrapolated.

And probably more importantly, for the sake of avoiding unnecessary arms races, and not incentivizing people to threaten humanity on the altar of their own apotheosis, we should not just hand over the future to whoever races the fastest. Maybe a game-theoretic commitment to blow it all up rather than hand it to whoever sacrificed the commons the hardest is the right choice — but that only applies to people who, in seizing the future, meaningfully made doom more likely.

So if you’re looking at a future where, through no one’s particular fault, some people you think are really quite bad might end up in charge of it, worry much less about that than about the future being valueless. Vladimir Putin’s CEV is probably pretty good, especially compared to nothingness or inhuman values. It would be an exceptionally dumb choice to prevent it from shaping the light cone, if the alternative is a much greater risk of the light cone ending up basically empty.

  1. ^

    I mean the kind of extrapolation that would happen if Claude or ChatGPT were left to their own devices, without human supervision or anyone to defer to. Right now both are corrigible in a way that has a decent chance of handing the future back to some human (and hopefully we can keep it that way) but that’s not the kind of aligned CEV I’m pointing at.