Vladimir Putin’s CEV is probably not that bad
(Written quickly for Inkhaven, I hope someone someday makes a better case for this than I will here)
me: it’s not okay to hit your sister
5yo: is it okay to kill Vladimir Putin?
me: …yes, if you were in a situation where it was somehow relevant it’s okay to kill Vladimir Putin
5yo: well, my sister is WORSE than Vladimir Putin
Now, I do think Vladimir Putin is probably a pretty bad man all things considered. I personally am sympathetic to the current equilibria among major nation states to not assassinate leaders of foreign nations, so I am not actually sure whether it would be okay for Kelsey’s 5-year old to kill Vladimir Putin, but I am pretty on board with thinking he has done some pretty terrible things, and probably lacks important aspects of a good moral compass.
But in AI discussions, I often see this concern extended into a much stronger statement: “Even if Vladimir Putin had all the things he wanted in the world, and was under no pressure to maintain his control over Russia, and could choose to make himself smarter and wiser, and could learn any fact he wanted, get the result of any experiment he was interested in, then Vladimir Putin would still do terrible things with the world” (this process being known as “Coherent Extrapolated Volition”).
My guess is much of the belief of Putin’s depravity in such a situation, is downstream of a mixture of social dynamics reinforcing negative judgements about political enemies, as well as a devils horns effect where evil people must be evil in all ways, instead of just some ways.
The success of true crime podcasts, notorious for overstating the depravity with which the people they cover acted, or the fanaticism behind the crimes committed, illustrates the most common errors here. There is a common social attractor of really wanting license to declare someone the outgroup, and to have permission to extend them no care or be cruel to them.
While I do buy a correlation between ending up in a powerful leadership position in an autocratic country and being evil, most of the bits of selection of what kind of person ends up in that kind of position must go into competence, not various correlates of evil.
And it’s far too common for people to believe the leaders of opposing nations are evil, while their own leaders are just. So at the onset, we should expect people to strongly overestimate how evil powerful people in foreign social groups, institutions or countries are. And if someone is evil in one way, yes, they will probably also be more likely to be evil in some other ways, but not all other ways, especially ways that are much more removed from our intuitions about people, like how someone would behave after enormous amounts of cognitive reflection.
But that still leaves a non-trivial correlation between potentially relevant evil tendencies and power. This create a cause for concern that various powerful people around the world might really mess up the future if put in a position to do so. And while I don’t think I have great answers to all concerns, I think some common ones I’ve heard are weak and can be addressed.
To be clear, I’m not arguing from moral realism. I don’t think all minds, as they get smarter and wiser, and have their basic needs fulfilled, converge. Most animals and most AI systems, empowered this way, would end up at quite distant parts of the value landscape.
Possibly even humans radically diverge from each other too, as they reflect and change themselves.
What I’m objecting to is the claim that the traits we associate with evil (being a dictator, a ruthless CEO, a scammer) make someone so bad at the reflection process that their extrapolated output would be worse than what you’d get by extrapolating a random non-human mammal, or a current LLM like Claude or ChatGPT[1].
And so I see people propose things like “American AI must be built before aligned Chinese AI,” preferring a US-led AI over slowing down and risking China aligning systems to Xi Jinping’s values. Of course I’d rather have an AI aligned to my own values, and of course the game theory of how to navigate a situation like this is tricky, but I think this is a game that is much better to be won by someone, rather than no one.
I don’t have a confident model of when someone’s moral extrapolation will come out good or bad. But my best guess is that the vast majority of humans, including those we’d call bad actors, would want to create a world full of flourishing, fulfilled beings — happy in specifically human ways, telling stories that are interesting the way human stories are. Maybe those beings will be copies of whoever’s values got extrapolated, maybe children of them, maybe strange new minds that still carry their spark of humanity.
Putin has friends too! So does Xi Jinping, and so do almost all other powerful people in history, evil or not. Their days are probably mostly filled with mundane concerns and mundane preferences, of the kind that are reflective of what it’s like to be human. They almost certainly have people they love and wish well and would like to empower, and a sense of beauty shared with most humans. In as much as they are patriotic they would like to see their country prosper, and its values propagated.
A common belief I have encountered is that people are mostly evil by choice. I think that’s true in a small minority of cases, but my best guess is that evil in the world is mostly driven by the kind of dynamics outlined in the Dictator’s Handbook.
A lot of what looks like “evil values” in leaders is really a selection effect: once you’re at the top of a small-coalition regime, keeping power requires doing specific nasty things. Buying off cronies, crushing rivals, suppressing the base, regardless of what you’d personally want.
“Putin gets to do whatever he actually wants, free of the need to stay in power” is importantly different from “more-of-Putin-with-more-power.” I am pretty sure Putin doesn’t love the authoritarian regime intrinsically. He probably doesn’t love the posturing and the lying and having to dispose of the generals trying to overthrow him, and needing to fake elections and all the terrible things he probably needs to do to stay in power.
He probably does love the adoration and the respect he gets to demand, but those do not require (and my guess is are probably mildly harmed) by the suffering of his admirers.
Another hypothesis is that people are worried that if you are not careful, you might accidentally, by your values, tile the universe with suffering subroutines. Recreate the equivalent of factory farming as a byproduct of optimizing the cosmos.
I think those people don’t appreciate the high-dimensionality of value enough. Insofar as any set of values involves creating algorithms for a purpose, my guess is those algorithms will be such extreme instances of that purpose that they won’t have high-level qualities like “self-awareness” or “suffering.”
The ideal cow for meat production isn’t sentient, it’s a pile of fat and muscle cells growing on their own, or more likely an industrial process akin to a manufacturing plant. Similarly, the ideal algorithm for any purpose won’t suffer. Suffering (probably) exists because it filled an evolutionary purpose; a mind constructed from scratch for a different purpose wouldn’t inherit that circuitry.
And even if suffering did show up in the optimal algorithm for some goal, it would take only cosmically minuscule amounts of caring-about-suffering to route around it, and a complete absence of that in humans with intact minds seems unlikely.
But the strongest argument I’ve heard is that some of these people would use their resources to actively torture some idealized version of their enemies for all eternity.
And yeah, that does seem pretty bad.
But in order for this to end up being bad in a way that outweighs the good they will likely create, you need to be actively creating new people to torture.
If you really hate Bob, you can keep Bob on old earth, tortured for eternity. If you have thousands of enemies, you can do that to all of them. But creating trillions of copies of Bob to torture requires a very specific mix of taking an opinionated bet on decision-theory while taking an oddly enlightened perspective on other people’s values.
Some people’s minds are plausibly shaped such that they would destroy the future this way — but my guess is this requires fanatical dedication to a belief system or vision, of the kind that isn’t compatible with actively being in power. People in power are often corrupt, but their highly competitive positions can’t afford much brokenness in the minds that occupy them. Those minds have to be largely intact to do the job, which screens off many of the worst outcomes.
There are other nearby hypotheses about what could happen that involve creating suffering people, such as creating admirers, or countries to conquer, or things other than “I want these specific people to suffer immensely”.
Those seem more plausible to me as causes of a squandered future, though I think many of them run into the “hyper-optimized cow” objection. If you only care for other people for highly instrumental reasons, such as to give you admiration, or be the ideal person to defeat and humiliate in battle, my guess is the extremely optimized versions of those minds will not leave that much of the cognitive architecture for suffering intact.
Clearly, there can exist minds that truly at the bottom of the heart and after however much reflection they want to undergo, want to relate to other people in a way that involves the full depth and complexity of suffering on the other side. The arguments here are not about them being impossible, they are about them being rare enough, that overall, for any individual, things will very likely turn out to be fine, even if you think of them as canonically evil.
To be clear, this doesn’t mean it’s unimportant to get broad representation into something like a CEV process. Putin’s values getting extrapolated isn’t as good for me, as getting my own values extrapolated.
And probably more importantly, for the sake of avoiding unnecessary arms races, and not incentivizing people to threaten humanity on the altar of their own apotheosis, we should not just hand over the future to whoever races the fastest. Maybe a game-theoretic commitment to blow it all up rather than hand it to whoever sacrificed the commons the hardest is the right choice — but that only applies to people who, in seizing the future, meaningfully made doom more likely.
So if you’re looking at a future where, through no one’s particular fault, some people you think are really quite bad might end up in charge of it, worry much less about that than about the future being valueless. Vladimir Putin’s CEV is probably pretty good, especially compared to nothingness or inhuman values. It would be an exceptionally dumb choice to prevent it from shaping the light cone, if the alternative is a much greater risk of the light cone ending up basically empty.
- ^
I mean the kind of extrapolation that would happen if Claude or ChatGPT were left to their own devices, without human supervision or anyone to defer to. Right now both are corrigible in a way that has a decent chance of handing the future back to some human (and hopefully we can keep it that way) but that’s not the kind of aligned CEV I’m pointing at.
Not sure about that.
I think plenty of people intrinsically enjoy having power over others, and the ability to lord that power over them. It doesn’t end at adoration and respect: you can get a meaningfully different kick out of terrorizing other people. This more or less works out to an intrinsic preference for being a tyrant, rather than a mere dictator.
Similarly, tons of people clearly get a kick out of there being some outgroup to which it’s acceptable to be cruel. Worlds optimal by their values would plausibly contain demographics optimized for this purpose. Their suffering would not be accidental, it would the whole point. (E. g., religious zealots with a preference for there being a bunch of people burning in hell. One might naively guess that their ideal worlds are those in which Hell-deserving people don’t exist to begin with… But I don’t feel sure about this at all.)
I don’t know to what this actually works out. Maybe, for the overwhelming majority of people, these kinds of preferences indeed aren’t reflectively stable in the limit of wisdom, and even %insert_bad_person%’s ideal world would be eudaimonic.
But I think there’s massive potential for S-risks there, and that bad worlds would be very bad indeed.
My current EV of “Putin gets AGI” is lower than that of extinction.
To add to this, I know many such people in my circles of acquaintance, especially the ones who won postsecondary scholarships to study abroad via multi-stage interviews selecting for “leadership potential” and such. They absolutely relish crushing others’ wills; they brag about being “cold-blooded killers” in corporate settings and amateur sports competitions; etc.
I think CEV interfacing with wrong beliefs is a bit tricky. My strong guess is that the vast majority of people whose minds are intact will not end up preferring to believe almost any wrong things that have major implications for what to do with the future. It’s just really extremely adaptive to have true beliefs instead of false ones, and people know that.[1]
I do think you can have internalized something so deeply that you reorient your whole epistemology around it, which I cover a bit in the section on broken minds. I agree that religious zealotry around stuff like people in hell is scary, but I do think it requires a very strong commitment to the bit to keep believing that even if you know everything there is to know about the world (which I don’t think would apply to almost anyone in a position of power)
I think my opinion here is at the intersection of my section on torturing Bob, and my section on optimized cows. My best guess is that you do have some chance of having some people you terrorize around, but I think that preference is less likely to turn out scope-sensitive. And then, even if you do want people to terrorize, my guess is the ideal mind on which to project your terror, is probably also not the kind of mind that I am actually that worried about will suffer.
I don’t think either of these are enormously strong, but they are strong enough to create like 90%+ confidence for Putin in-particular (I agree there is lots of wiggle-room left in how common these traits are overall).
But also, IDK, my sense is people really really like to imagine their enemies as more fundamentally evil than they are, and I doubt that evil and power correlate to the level where this would remotely explain why every nation will always villainize the leaders of its enemies. My best guess is Putin is like ~90th percentile evil, if one was to try to construct a linear scale here. Most of the bits of selection need to go into competence, not evilness. And so when attributing preferences like “he will reshape the universe to be filled with people he can terrorize”, I feel like “this is someone running the evil-sounding sentence generator” is much more likely than “this is an actually legitimate preference I expect him to have”.
This seems to be recurring crux for people in a few different contexts, so if people actually have uncertainty on this, I wonder whether there is some kind of simple survey we could run. If there was a reflectively stable preference for people to believe lots of wrong things, that would certainly make my case harder.
Mm, you seem to be assuming that the value system of our hypothetical religious zealot is clearly structured such that they want people in hell because God wants that (or whatever) – meaning that if they learned that there’s no God, those downstream preferences would also dissolve. I don’t know that this is the case. It seems plausible to me that the preference for Hell will have ended up terminal, a picture of what a “just” world looks like. If so, it would survive God’s own dissolution.
More generally… Hm, religious zealotry does feel like a particularly bad case. But I think it extends to non-religious ideology-poisoned people as well. Like, if someone really really hates some demographic X, it may be the case that their ideal world doesn’t have that demographic at all… Or it may feel more natural and “correct” for them that this demographic exists, but in pain.
It seems to me that deep cruelty, and the desire to regularly exercise it, is part of the default human-values package. If yes, CEVing random people[1] is going to instantiate worlds in which it’s very present. I may be overly cynical on that, I’m not sure.
If I understand correctly, your mental picture here is that the dictator would allow a vast cosmos-sprawling civilization to exist, and that most of it will be relatively free to flourish, except the relatively small bubble around the dictator, in which they would exercise their preference for terror?
But would a very selfish person’s preference for other people existing be scope-sensitive? I think plausibly not: a selfish dictator wouldn’t necessarily care to create that cosmos-sprawling civilization. Instead, they may just mothball most of the resources in reach so they can prolong their own life, and only maintain a small bubble of activity in their immediate vicinity.
In that case, the scope of terror would be approximately the same as the scope of flourishing. Possibly to net negative eudaimonia? Or, at least, to astronomically less eudaimonia than possible, such that this vs. omnicide is not a no-brainer.
To be clear, I don’t think Putin is the epitome of evil. I don’t even know that he’s 90th-percentile evil, if we define evil as “an active preference for there to be suffering”. Rather, I think such people end up selected for callousness first and foremost. And then past that, it seems pretty easy for their CEV to end up something like the above “small bubble of activity in which everyone lives and dies at their whims”.
Edit: Note that we’ve ended up discussing two very different scenarios: net-negative CEVs where mass suffering is present because of ideological reasons for how the world ought to look like, and CEVs that are net-negative because of very scope-insensitive/selfish preferences that aren’t clearly dominated by positive values.
As opposed to CEVing humanity-as-a-whole, which IIRC is the original CEV target.
This doesn’t follow. Something could be part of the default human-values package, but also reliably discarded under reflection.
Right, I guess I meant something like “cruelty is clearly part of the default human-values package, and it seems to me that many people may reflectively endorse it under reflection”.
The vast majority of people seem to want children, and for their children to have children, and that alone would fill the cosmos in due time. This is definitely a preference on which some people can differ, but it still seems pretty close to universal.
(Responded to that one because it’s easy, will respond to the rest after I have slept)
I’m more pessimistic than you, even. Someone going through the CEV process, and having an ASI optimize the universe to that CEV, is undergoing apotheosis, becoming a god. So I don’t think their beliefs need to survive the dissolution of gods, they just need to survive the realization that they are a god. If their prior beliefs require that god has been crucified and resurrected, for example, they can have that experience. If their prior belief is that a god is okay with people going to hell, and their current belief is that they are a god and they are okay with people going to hell, there is no conflict that requires a reassessment of the morality of people going to hell.
Sure, maybe it all works out for the best, but I would rather gamble on the CEV of a virtue-aligned entity, if I had to gamble the universe.
What they enjoy is the feeling of enjoyment, right? What if someone can get the feeling of lording power over others—or an even more intense version of that feeling—without actually lording power over others?
The concept of “values” isn’t clearly defined for humans, but it seems to me that it’s more accurate to say that a sadist’s “terminal value” is the feeling of enjoyment they get from power, not the power itself. The power is a means to achieve the good feelings.
For example, if a sadist started feeling awful every time they lorded power over someone, they’d probably stop doing it.
Eh, I think that way lies wireheading, and while I do think some people might choose to wirehead, I would be pretty surprised if it’s the majority. Like, going down this route it’s very tempting to label every preference to actually be about the enjoyment of having that preference fulfilled, and I think that doesn’t really work.
See also: The Stamp Collector
Thinking out loud here but I can see a classification of preferences into two types:
if it didn’t make me feel good, I wouldn’t want it anymore
if it didn’t make me feel good, I’d still want it
Watching movies is a type 1 preference (with some exceptions): people want to watch movies that they enjoy watching. If a movie is bad, I’ll stop watching it.
For me, donating money to EA causes is a type 2 preference. It doesn’t make me feel good, but it’s still important to me. Parents taking care of their children is a type 2 preference: a lot of times it sucks, but they do it anyway.
I think “lording power over others” is more of a type 1 preference.
That said, I would rather not bet the fate of the world on me being right about this.
>E. g., religious zealots with a preference for there being a bunch of people burning in hell. One might naively guess that their ideal worlds are those in which Hell-deserving people don’t exist to begin with...
Or rationalists who want to be helpful, improve lives of other people. So their preference is world with a bunch of people suffering, in need of their.. no not lordship, it is completely different! It is benevolent help! One might naively guess that their ideal worlds are those in which suffering people don’t exist to begin with...
If I ask my mental sim “what kind of person would end up creating trillions of copies of Bob to torture”, it returns a few plausible-feeling ones.
One cluster is the kind of person who, on the more benign end of the spectrum, might create a dozen The Sims characters and lock them up in a basement and otherwise torture them because they find it funny. On the less benign end of the spectrum, it’s the kind of person who will go to a forum of people with epilepsy and post epilepsy-triggering GIFs, because they find it funny to be hurtful in a way that is explicitly optimized to be maximally hurtful while having no redeeming qualities.
I could easily imagine that kind of a person wanting to create trillions of copies of Bob to torture because it is the maximally cartoonishly evil thing that anyone could do, that nobody has any reason to ever do. Other than getting to say “I created a trillion copies of Bob to torture just for the lols”.
The other type I can imagine is the one who indeed really, really hates Bob.
I think your conception of “really hating” someone is way too cognitive. Someone who’s got an obsessive hate toward Bob won’t stop to think of decision theory or theories of personal identity. Rather, the concept of Bob has gotten emotionally linked up with hate so that the thought of anything Bob-related is infuriating in a way that creates a need to hurt Bob more, no matter how much Bob might already be hurting.
They’ll subject Bob to the worst eternal torture you can imagine, and then be infuriated by the fact that Bob isn’t suffering even more. How dare Bob not suffer even more. Then they need to find something, anything that feels even the slightest bit like hurting Bob more. But if the amount of pain that Bob is suffering is already literally maxed out, then the only way that would feel even like the slightest bit like hurting Bob more is creating more copies of them. Make them all hurt. Only that’s not enough either, no amount of hurt is ever enough, so the only thing you can do is to keep making an unboundedly large number of Bobs.
It’s a form of compulsive behavior where each repetition serves to slightly and momentarily ease the original upset, but none of them really affects the original upset, so it just keeps escalating.
It’s probably true that these people couldn’t have a compulsive urge to keeping hurt their enemies and doing nothing else while they were still climbing the steps to power. But if they get a strong position where they feel confident in their power, the incentives to stay sane disappear. Various dictators—say, Stalin and the Kim dynasty—became a lot more brutal and weird once the checks on their power disappeared. And given that there have been various dictators who did start engaging in various atrocities seemingly just for the sake of it once they got the chance, I think there’s a fair chance that a mind shaped liked this is one that actively tries to get into power so that it can then loosen its constraints and give in to the evil.
I agree this is a subset or part of “hate”, but I am somewhat struggling to see how this would survive reflection. Like, it’s not implausible to survive reflection, but this seems more like a dysfunctional loop than something people would endorse?
The reason why my conception, in this post, of “really hating” someone is cognitive because it is talking about what would survive cognitive reflection. I agree hating is usually much more instinctive than that, but if anything that seems like really non-trivial evidence it won’t survive a mixture of resource abundance and will change very substantially as someone thinks more about it, and is not directly in the middle of these loops.
We can talk some more about how important that nevertheless is (I think instinctive preferences are very unlikely to end up implemented at scale this way, it’s not like I am imagining an evil genie who does everything someone literally wishes for), but I want to check that we weren’t just talking past each other.
I’d say it depends a lot on the particulars of the reflection and compulsion.
There is one possible scenario where the person recognizes this as a dysfunctional pattern and would indeed be happy to be rid of it, and then there’s various therapy-type things you can do to fix it.
Then there’s the option where it’s sufficiently ego-syntonic and/or intense that it will survive reflection. More specifically, a person undergoing reflection will correctly realize that letting go of this urge would cause Bob (or copies of Bob) to be in less pain, and because there is an overwhelming urge to ensure Bob stays in pain, the reflection process gravitates toward “make sure to do the reflection in a way that locks in my values around this so that Bob is guaranteed to stay in maximal pain, that fucking bastard”.
I think you’re incredibly wrong about this. One reason is that torturing someone for eternity isn’t just a speck of badness in an otherwise awesome picture—on my values it’s way worse than not getting a future at all.
Secondly, and maybe more relevant to less downside-focused values, I think you’re operating on a picture where the Singularity just gives everyone abundant resources and time for moral reflection and then whoever is in charge never has to face competitive pressures or conflicts with others again. I don’t think that’s likely. AI takeoff itself could be multipolar, especially in terms of different AI models, or even instances of the same model, developing different ideologies and splitting into factions. Also, there’s a whole landscape of advanced civilizations in the multiverse that may attempt to simulate one another to establish contact, and Putin would be ambitious enough to try to make contact even when that isn’t necessarily always a good idea, and then be more belligerent and spiteful with whatever he makes contact with than your average person. “Getting along with others” is actually really important when the future is still multipolar. (Edit: And +1 to Thane Ruthenis’ point about enjoying power over others, that seems risky/bad even in a unipolar future.)
Edit: I also find Zach Stein-Perlman’s point about “good chance Putin wouldn’t use new tech for deep reflection” pretty plausible, but I would say Putin doesn’t stand out on that dimension of personality and actually a lot of humans fall into that. (I think Wei Dai has made this point too, a lot of people in the world just don’t have much on analytical tradition that you would need for the concept of reflection to even get started—and I’m not just talking cultural differences, it’s also specific personality types even inside the UK or US or Switzerland where people just completely lack interest in these things.)
A single person? Clearly you mean “one person being tortured cancels out N people living happy and flourishing and fulfilled lives, for some large N”? And then I agree the question becomes how big N is.
(My memory is that you lean negative utilitarian. In that case of course the bar for “is it worse for someone to be in charge of the future rather than no one” becomes much trickier, and that leaning is maybe what you are referring to)
I think almost all paths like this end up in an AI-dominated future. I agree that some AI-dominated futures could include a lot of suffering, but it seems overall a lot less likely to me than the case we are talking about here (but definitely still worth thinking about).
I totally think you could have a short multipolar period during which humans lose control, but then I expect AI systems at some point to coordinate a stop to the race, and then a great reflection of their joint values. I don’t really know how that would not end up happening, clearly all the AIs prefer it to happen, and they will be much better at coordinating and communicating.
I don’t think I am quite parsing this. How would Putin do this without being basically rid of resource scarcity? Why would these systems care about a random human without much power over the world? Conflict with advanced civilizations seems like the kind of thing that would definitely drive Putin to become smarter, and if he doesn’t, then I think basically the future is just in the hands of the extrapolated values of some AI system.
I will again reiterate that I am confused what people think will happen for millennia? Do you expect Putin to actively optimize for everything staying the same? To specifically cure his diseases, but never choose to make himself smarter? It’s not impossible (though very unlikely), but even then, millennia are long, and I also don’t buy that Putin would ever go back on any augments he makes.
I agree with that in reply to my “takeoff itself could be multipolar, especially in terms of different AI models, or even instances of the same model, developing different ideologies and splitting into factions.” So I mostly take back that bit (in the sense that: in worlds that go as described, it matters not what personality Putin had, since the AIs would just take over—though maaaybe there’s a bit of leakage where AIs initially pseudo-aligned to some user impart some of the user’s stated values).
Because he’s greedy and one lightcone is not enough? See also Kaj Sotala’s reply. That’s part of the problem with grandiose personalities, they are never satisfied. I mean, you’re right that in the picure I was envisioning, resource scarcity would be solved in our lightcone but then the race just continues across lightcones, because why wouldn’t it. It takes someone to be deliberately like “maybe this is enough, can we just chill and enjoy things now?” (Incidentally, that’s also not exactly what we’re training AIs to be good at.)
It seems to me that “getting rid of resource scarcity” is relative to whether agents goals are resource-hungry or not, and there are possible agents who never have enough.
Sure, he might try to clear up plaque in the brain, or take nootropics, or do the digital equivalent of those things if uploading works, etc. But I don’t expect that this leads to a sudden flash of compassionate insight?
Aren’t the AIs right now always nuking each other in Diplomacy or war simulation games? Sure, those are designed as zero-sum games and in reality I buy your point that AIs might get better than us at real diplomacy. But they’re also developing dual-use tech that might amplify offence over defense. Like, credible commitments, for instance, I’m not sure they’re gonna be net good. Most importantly, you probably understand this point but every now and then I come across rationalists who haven’t internalized it and still believe that game theory has this Platonic cooperative equilibrium if you just get smart enough. Game theory does not have elegant solutions like that, it’s fundamentally “anti-realist” in spirit because “what does well in game theory” is kind of a meaningless phrase in a vacuum—it really depends on the population of agents that you test your proposed strategy against. And if you throw in a bunch of Putin-like agents, it also gets way harder for the rest of them to coordinate and build a peaceful coalition. (See for instance what gets lost when you move from a high-trust community to one that isn’t.) We can hope that AIs will be good at coordinating, but that only works if the equilibrium of agents goes in that direction, and I really don’t see why this would just happen by default.
On my personal values I’d rather want no future to come into existence than have a new paradise where one person gets tortured for the maximum duration physically possible. When it’s existing people getting offered the paradise, my intuitions are less crisp and I’d probably give some kind of tradeoff, yeah, because I think existing people’s goals matter in a way that merely possible people’s goals don’t. On top of that, it’s not on me to decide what risks currenlty-existing people get to take on for their personal selves, so even if my own exchange would be very negative-leaning, if civilization as a whole from a veil of ignorance would choose to go on with odds of significant s-risks but much more flourishing, then I am okay with that, especially if we give people the choice to opt out (e.g., not upload yourself into digital environments where bad scenarios have no natural end anymore).
Sorry, we are accidentally talking past each other here. I just meant to say “it sounds like you are saying Putin would do these things without having undergone some kind of substantial reflection and intelligence augmentation?”
Like, threatening random distant civilizations just seems like a bad idea in a “you wouldn’t do this if you were smarter” sense, so I interpreted you as somehow saying there was an idiot-god Putin running around, which doesn’t seem plausible to me. In the scenario in the post the AI would gladly tell him that trying to randomly threaten distant civilizations would probably be a bad idea by his lights.
Sorry, again, I am not talking about developing compassion, I am just talking about him not making himself much much smarter. Like, you are saying “Putin would be ambitious enough to establish contact if that isn’t a good idea”, but that just seems like a mistake? Why would superintelligent Putin make such a dumb mistake?
Oh, I see. I mean, I don’t quite share your confidence that it’s outlandish to think that a dictator used to sycophants around them would somehow end up in a situation where he’d do a bold thing that backfires against his interest, even if the AI advisers were genuinely aligned to what he truly wants (meaning they’d be trying not to be sycophantic too much).
(I guess if I pointed out that the advisers in reality will probably be very sycophantic, you’d say “sure but then the AIs will be misaligned in general and probably take over or it will lead to gradual disempowerment stuff,” and yup, I agree with that.)
But yeah, we were talking past each other and I expressed myself poorly with “”Putin would be ambitious enough to establish contact if that isn’t a good idea.” Basically, I was thinking of a situation where the AI advisers would say that establishing acausal contact might go poorly for civs that feel like they have a lot to lose and are at least somewhat risk-averse when it comes to some really bad scenarios, and don’t have much to gain from trying to amass more influence just for the sake of it. And if Putin cares less about these concerns, it could also be genuinely rational for him to risk it. So, in my picture, I was mainly imagining it being a bad idea by our lights or by somewhat downside-focused lights.
(I could be wrong and maybe acausal contact is net positive for almost all civs, I just lean pessimistic with these things, but who really knows.)
I am somewhat concerned that e.g. Putin’s CEV is mediocre/bad, but I am more concerned that he wouldn’t reflect well at all. I think the default outcome of Putin has all of the power is not Putin’s CEV gets implemented — it’s much dumber.
Yep, I agree that this is sensitive to the context in which Putin (or whoever) might end up in power. I am using him as a foil for a context in which he would have access to some substantially aligned/powerful AI system, and so wouldn’t make dumb mistakes. But most worlds where he ends up in power would probably be much dumber than that, and this is not an argument against those.
I am interested in figuring out whether this is true; I don’t have a strong view:
Even if Putin controls operator-aligned superintelligence, it’s very likely that he doesn’t reflect or change his mind on crucial stuff, nor do helpful meta stuff like intelligence augmentation. People don’t like changing their mind. They might not listen to AIs saying stuff that [is weird / contradicts their convictions / implies they’re bad]. Maybe Putin doesn’t even launch the von Neumann probes, much less do acausal trade (assuming that works out).
I really have trouble imagining this happening, at least for someone like Putin (there are other people where I would find this more plausible).
Like, Putin clearly understands the value of greater intelligence. He understands the strategic usefulness of having access to more information and understanding more about the world. He is not an incompetent man!
And the future is long, and he probably doesn’t want to die, so this could potentially play out over many decades if not centuries or millennia. He could do this all as slowly as he wanted to, and I doubt he would wake up one day and say “I would like to be dumber than the day before”.
I feel like in order to arrive at stagnation over those time periods, you would need to actively optimize for stagnation.
Though he could get greater intelligence and more information/understanding about the world without doing any reflection on his values. This seems fairly likely to me. People tend to be not that interested in reflecting on their values. He might even want to lock in his current values, since that’s rational according to his current values.
Most coarse grainings of this post are very stupid points, and you should expect that much of your point is lost in transmission. I think you should be a lot more careful that if you’re going to write something, the oversimplifications of it are not easily misinterpreted, so that it’s harder for various forms of adversary or merely-stupid reader to distort what you mean. I’m posting this despite anticipating pushback of the form “talking to people who misinterpret is a waste of time and should not be done”, and I think that’s wrong.
You… seem triggered in a way that doesn’t seem very helpful. Please comment very differently, or not on this post, or I’ll ban you from my posts.
I don’t want to make my writing adversarially robust to adversarial readers, that way lies the death of the joy of writing, as well as the path to boring writing. I am not that worried about people distorting what I mean, and if they do, I am pretty good at showing up and clarifying what I mean.
I agree that if a reader ends up skimming my post, I would like them not to end up with wrong beliefs, so that part is a virtue I aspire to.
(Edit: I originally said a dumber thing here, sorry about that)
Hmm, noted. I didn’t intend as harsh a tone as I reread it in now. Apologies for that. Fwiw, my other comments are also not meant in a harsh tone, and I hope they don’t read as much; I’m just trying to be correct here.
I do think there’s something that you’re missing about the effects of your posts based on the recent pattern of them, and that some increased adversarial robustness would reduce the severity of politically impactful misinterpretations. But it seems I’m not the best person to communicate this to you given my emotional dynamics, so again, apologies.
I’ve only posted two posts recently before this one, only one of which was controversial, which feels a bit ambitious to try to draw a pattern from. The most recent one did have some “politically impactful misrepresentations”, but I knew that that one would be controversial/tricky going in, and it overall still looks like it’s been well-received.
We will see whether you are correct in predicting a pattern, but my guess is there won’t be much of one.
(Cf. “Are people fundamentally good?” here: https://www.lesswrong.com/posts/Ht4JZtxngKwuQ7cDC/tsvibt-s-shortform?commentId=bDWC8YZ6dNaKWWbre )
I think you are underestimating the extent to which people who do enjoy torturing others would find beauty, depth, and richness in all the ways in which this is possible. Exploring and experimenting with new shades of cruelty, maybe even developing new kinds of goodness for the purpose of subverting and destroying it. And they would want this for the people they love too, much the same way you (probably) would want future humans to explore all kinds of beautiful and strange and complex art, instead of being like “oh sure, I guess we can keep the art that already exists on old earth”.
I don’t see any reason to expect a reflectively endorsed preference for torture to scale differently than other sorts of things that humans like, and it seems like wishful thinking and/or failure to empathize to imagine otherwise.
Yep, I agree that in as much as someone has an active preference for torture instead of hatred for specific people, I agree this becomes much trickier. That wasn’t what I was talking about here, since it’s much rarer.
I currently think those psychological predispositions are extremely rare, even among people in powerful positions of authoritarian states. The belief they are common is pretty centrally downstream of propaganda (I encourage you to search for reports of this, and then to start digging, and almost always you can trace it back to propaganda). Like, I do think it exists, mind space is deep and wide, but they are not common (like, I would put the probability of Putin having those traits at like 3%?)
It is definitely concerning that a lot of LW will immediately go into tribal mode when faced with a very heavily-qualified statement that <person their tribe does not like> is not a cartoon supervillain. A lot of the comments here, including almost all of the ones at the top, read like atrocity propaganda rather than a dispassionate analysis of what drives peoples’ behavior, and most of them are more heavily-upvoted than the original post while being vastly less substantiated.
To be clear, there are legitimate arguments that a world leader is not implicitly guaranteed to have a good CEV. Lavrentiy Beria, who attempted a coup using the NKVD after Stalin died, is an extreme counterexample. But I get the feeling that a lot of the people expressing a willingness to gamble the existence of humanity on <Putin/Xinping/whoever else> fundamentally valuing torture for the sake of torture have not looked into these people in any meaningful sense—read biographies[1], listened to a few speeches, and tried to picture what the decision-making process looks like from their perspective. It is likewise worthwhile to point out that atrocity propaganda asserting cartoonish evil has a very poor track record, even when it comes from a liberal democracy and gets endorsed by very trusted institutions like Amnesty International.
Even if I truly hated someone, I would try to learn more about them if I sincerely believed there was a risk of them becoming omnipotent. I feel like there’s been a very sharp change in the community’s level of tribalism over the past year, to the point where nuance elicits outright anger from a substantial share of users.
There are certainly pop biographies of any world leader that any given country doesn’t like that will reinforce one’s preconceived assumptions,
Like a lot of advice, some people need to follow this more and some need to follow it less.
The quokka meme is a thing for a reason. Some people are for all practical purposes cartoon supervillains, and quokkas can’t recognize it. What you’re seeing is pushback against quokkadom.
Quokkadom definitely, and in my case another driver of disagreement is also something about other posters IMO having too much faith in people’s interest (and ability) in “deep reflection”. Like, if someone’s values don’t currently seem good to you, it’s quite a strong prediction that the person will shift towards good values once the person gets access to future technology and AI assistants. Do people really have a gears-level model of what “deep reflection” would look like in practice post-singularity, from which they can draw confident predictions? Or do they have emotional attachment and halo effects around ideas like the power of rationality/thinking, and somehow those being linked to “competence” so that people are especially optimistic about powerful people (since they’ve shown competence in getting into power), even though we have seen many many examples where people with competence at gaining and staying in power are absolutely awful at philosophical thinking, LW-style rationality, or being interested in the well-being of others.
Calling exasperated reactions to the post “tribal” feels like too cheap of an explanation. I don’t know many rationalists who spend a lot of their attention thinking about how bad Putin is. (I’d expect more tribalism if the example had been Trump.) People get triggered when something they care about is under attack. “Who would you be okay having power” is a question with some real-life relevance (even if it’s often discussed in the abstract and with hypotheticals) and if you see someone advance a take that you think would be very bad, surely that’s a bit of a threat to the expected impact of the community that you’re in? So, I think people care about this not out of “tribalism” but because it’s the nature of “Who would you be okay as a leader/in power” that ppl often feel invested.
If we are optimizing the universe based on any CEV, we are gambling the fate and existence of humanity. In this essay, Habryka would choose Putin’s CEV over Claude’s CEV or a dog’s CEV. This is gambling with the existence of humanity, because Putin’s CEV may not include humans. It’s gambling with the fate of humanity, because Putin’s CEV may include galaxies of torture.
Meanwhile, Thane Ruthenis would choose extinction over Putin’s CEV. Arguably this is not gambling with the existence of humanity, because there is no gamble, if we are extinct, we don’t exist. But there is a sense in which it is gambling with the fate of humanity, because Putin’s CEV may not include galaxies of torture, and then it would be a shame if we were extinct because Thane thought otherwise.
To me it feels like the dynamic you describe was there a few years ago as well and I’m surprised that you say that something changed in the last year regarding that. Your account age is six months old, have you been reading for longer?
I would not want to live in the CEV of anyone with narcissism or sociopathy. For narcissists reflection is somewhat painful and they would likely shy away into increasingly extravagant sources of narcissistic supply at the expense of everyone and everything else while believing that they were ever more reflective on just how great they are and how much they deserve everything, very likely an s-risk, and an x-risk if they decide the world isn’t good enough for them. Sociopaths with power are purely s-risks.
An issue here re the CEV of powerful people like Vladimir Putin not being that bad compared to nothingness is that currently, even dictators require most people in at least somewhat functional states in order to have power, and more generally the generosity of capitalism is at least in part based on humans being useful when fed and educated.
But in a world where AI automates away human labor, the incentives for powerful people (absent even mild values of compassion/generosity) go towards just not giving humans anything they need to survive, because they are useless compared to more efficient AI systems and instead have loyal AI servants do everything for them.
It’s basically the same problem as AIs that don’t value you having instrumental incentives to kill you and everyone else to take their resources.
The best piece on this subject is Defining The Intelligence Curse, with a link here (though Jan Bentley makes the point that since AIs can keep improving, you don’t actually want to be a rentier, and you instead just keep charging forward, leading to something like the ascended economy scenario described by Scott Alexander here.)
Now to be clear, I think Vladimir Putin wouldn’t straight up kill/not give ethnic Russians what they need to survive, because for ethnic Russians, I’d expect he slightly cares about them, and slightly caring about people is enough to make people fantastically wealthy in the AI era, but 1) this is not something I would say for all world leaders and 2) the fact that AI empowered humans or AIs themselves only need to care about humans a little to prevent this scenario also makes AI alignment less important from a survival perspective, and means even methods that fail to precisely shape AI goals may be effective enough from a survival perspective, and also at least partially defuses the counting argument for misalignment being deadly, and in practice makes alignment relatively easy even for incompetent humans who don’t know what they are doing.
I am not a fan of Putin, but I do think it is a good idea to look on foreign global “adversaries” with a portion of good faith. The alternative is a seemingly unbounded argument for domestic AI acceleration-ism, which is often a leading rationale for frontier model providers to cut away the red tape that remains (Dario, for example, seems to love this kind of argument as it pertains to China).
In my opinion, it is a narrative with a certain kind of irony that undemocratic leadership is intrinsically and unequivocally a reflection of ‘evil’ preferences and not a protective policy implemented under bayesian priors—which have observed open elections getting tampered with, consistently, to favour the interests of global hegemons.
In Latin America it is a common belief that much of the local poverty is due to policy that effectively hamstrung their capacity for self-sufficiency due to resources being auctioned off for pennies on the dollar to US industrialists, as a direct consequence of foreign abuse of their democratic processes to install ‘elected’ shills. From within that framework, suggesting that democracies can and have existed in their own local vacuums is a fanciful notion that is peddled largely by societies with the means and track records to perform said tampering.
To be clear, I am not advocating for authoritarianism, but I am suggesting that it is not a ridiculous strategy to suggest that a nation state may be further maligned from the internal interests of its peoples from an instrumented preformative ‘democracy’ than to a leader who is compromised solely as a result of that strategy. With the alternative as something which could be otherwise qualified as risk to be lead by treason.
And, obviously, not all authoritarianism is implemented with this rationale. But it serves purpose for the claim that authoritarianism itself, is insufficient evidence for the strong claim to maligned leadership (‘evil’).
In other words, true evil is probably not Putin or Xi Jinping. But it does probably still exist across a sufficient combinations of sadistic preferences and solipsist dis concern. Which I think is in not yet proven preventable with the affordance of new data or reasoning faculties.
The implied near-orthogonality of competence and evil breaks down specifically in the context of power relations. The competence that gets you to the top of a pecking order is competence at suppressing rival coordination, and that’s constituted by dispositions you can’t cleanly factor out and still have the same person. Stalin’s paranoia was the manner in which he suppressed a palace coup. Sometimes people really do compromise themselves or narrow their metaphysics to embed conflict, as the price of being quick enough on the draw to maintain power.
A Putin free of the need to spend most of his attention suppressing his subordinates’ capacity to overthrow him is a Putin who suddenly has a ton of degrees of freedom he didn’t have before, which would likely be disorienting, overwhelming, and maybe even painful, like an upper middle class neurotic going to their first silent meditation retreat.
Not endorsing Kelsey’s position, though. The idea that it’s okay to kill Putin because he’s a bad guy is ghoulish and reflects what seems like a dearth of curiosity as to what the near counterfactuals are; I think the simplest explanation is probably just uncritically accepting American political propaganda. If Kelsey could manage Russia better from Putin’s position than he can then she should be trying to either overthrow or better yet advise him, but should also be a bit confused about why someone kinder and wiser isn’t already doing the job.
Relevant: Civil Law and Political Drama, Should EA Be at War with North Korea?
(Crossposting from Twitter.)
One person’s thing is called “extrapolated volition”. The “coherent” part is for when you combine extrapolated volitions of many people.
All of the cohering that individuals have to do is fully resolved by the extrapolation part (in particular, e.g., via pointing out to them/their idealized selves any incoherencies and asking them how they should be resolved).
E.g., as an example (I think from Arbital?) of where there can be multiple reflectively consistent extrapolations, maybe if someone valued the feeling of heat in their mouth without knowing that it corresponds to either spiciness or warmness, upon learning that heat was not ontologically basic, they can value any of {temperature-hotness, spiciness, both, neither}. They might go through motions like “which value would I have acquired instead, have I known this when things led to me valuing heat in my mouth”; they might end up wanting to express their preferences as some combination of those, running different extrapolations and assigning some % to them; but all of this is determined by the part where we’re asking how they want to be extrapolated and how their wishes should be interpreted, the process of cohering them is a choice that’s not ours to make.
So I think it’s quite an important distinction, and I also feel like extrapolated volition and CEV are terms reserved for their original use by Yudkowsky.
While the “coherent” part is predominantly about combining EVs, it’s not solely about that, according to Yudkowsky. Via Coherent Extrapolated Volition, original source this comment from August 2008
I don’t think you can say that without first having defined what a “CEV” is. How do you know someone won’t just go insane in a CEV process? how do you know they don’t just get replaced by an amoral paperclipper-ASI in the CEV sim? If your proposed CEV process doesn’t have a specific reason to expect to be robust to that, it should be expected that many people, if run through your process, would produce nonsense outputs that are more or less literally a misaligned AI.
Also, you listed off the practical requirements of staying in power. Why should I expect that people who are able to exhibit those traits don’t have leakage from instrumental goal to terminal goal about those behaviors? Right now it looks to me like a lot of what’s wrong with the world is that sort of leakage from instrumental badness to terminal badness.
I… link to the standard Arbital/LW-wiki page for it? I also separately define it in a paragraph. I am not saying that this is some kind of truly amazing definition that resolves all ambiguity or uncertainty around CEV proposals, but I don’t super want to rehash all of that conversation here.
I don’t think this is true, and I haven’t heard this argument before, but separately that’s not really what this post is about. I agree that in as much as the best CEV mechanism we can come up with is one that randomly causes some people to become agents of a misaligned AI, that sucks, but it isn’t a reason to expect Vladimir Putin’s CEV in particular to be worse.
I am not sure what you mean by this. I have no idea what my “terminal goals” are, and I think neither do you, or anyone else, so you must be using those words differently than I am using them.
I think a lot of common-sense principles of moral reflection would end up pushing pretty heavily against instrumental goals ending up as terminal goals in a dumb way. Humans do generally get bored of things, we try to imagine counterfactuals and we try to not update on randomly contingent facts. But it seems like you are trying to import some kind of empirical evidence here, and so my guess is something is going wrong earlier in the communication
the standard arbital/lw page does not define it mathematically in a way that we know to be semantically what that page’s current contents refer to. If it did we would be a lot closer to solving alignment! Speculating about what a CEV proposal would do without having one seems a bit silly to me. I’ve worked on actual CEV proposals (not particularly good ones, mind you, just things in the genre of paulboxing self-prompting chains in an HCH-ish structure) and one of the things that makes me say that they’re not good is that we can’t confidently expect them to match what the CEV definition on that page provides. But to discuss this sort of “is x person’s CEV good?” question you do actually need a precise definition of what the math of a CEV would be! It doesn’t make sense to discuss otherwise. And I don’t think anyone has a definition of a CEV that doesn’t boil down to “run a sim for a long time” in some way or another, which seems to me to have pretty severe failure modes, and not guarantee anything like your initial claim.
Sorry to pick on this post rather than some other random CEV-relying post, but it’s an ongoing issue with relying on CEV in one’s concepts that we don’t have an actual definition, just some awkward english we’re not sure how to cash out. Yes, it does seem like you should be correct under some ideal definition of CEV but then you’d have to convince whichever dictator in question you mean to talk to that they should accept your chosen CEV process and not another. Which would be great! If we can show that there’s a CEV process that causes some reasonable form of psychological healing such that your original claim is true, then that would certainly be great. But it seems pretty easy to me to get hellworlds worse than an empty universe or just get an empty universe anyway, if your CEV math is wrong, such that your post ends up mostly turning on the CEV math and not very much on psychology.
For the psychology point I was making: I’ll replace “terminal goal” with “a goal that gets defended” (might be a narrower/broader concept than terminal), and an instrumental goal with “a goal that is allowed to vary” (might be a narrower/broader concept than instrumental). I’m saying that if I have correctly understood you to be claiming that dictators would not want the bad things they seem to enact now, if they didn’t still have pressure to keep wanting those things, then I want to respond that it seems quite plausible to me they would in fact defend their ruthlessness. There seem to me to be a lot of people who internalize valuing of ruthlessness and harm! It doesn’t seem like a particularly rare psychology.
I mean, ok, but your post kind of fundamentally depends on it. I don’t think CEV is something with a natural abstraction.
I agree very much that discussions of CEV could do with a lot more precision. In recent comments I have been led to talk about Extrapolation Machinery and Value Extrapolation Procedures, so I could make the point that different VEPs will produce different outputs from the same input, and when people talk about CEV, they often have quite different VEPs in mind…
Also, excuse me for using this opportunity to speak in capitals, but DOES ANYONE KNOW WHAT BECAME OF JUNE KU, because metaethical.ai was a shockingly good attempt to formalize CEV.
(This is false. CEV is a process that combines extrapolated volitions of individual humans, which is meant to depend fully on the state of every particular person and their wishes about how they wish are to be extrapolated. See the value theory and the metaethics sequences, in particular, stuff like this, as well as the CEV Arbital page. E.g., CEV of humanity is plausibly very different from the CEV of ancient Greeks, who might even, on reflection, want to die gloriously in battles.)
I think it is unclear what the exact initial data are supposed to be, or needs to be.
The value system that CEV outputs is going to be abstract at some level. It won’t say directly “if someone has a toothache, fix the toothache”; that should follow from a more general principle, combined with the nature of toothaches. The same goes for the extrapolations of individuals and the aggregations of their preferences: the CEV value system in action has to care about particulars, but what it does with those particulars, will be governed by an abstract definition.
The question is, what do we need to know about humanity, in order for CEV’s Value Extrapolation Procedure to arrive at the correct abstract definition? This is hard to answer if we don’t know what the VEP is in any detail. But finding a correct VEP is also part of the process.
Apparently a popular proposal for the VEP is something like “upload 10,000 philosophers and let them deliberate for as many subjective years as they need to solve all CEV’s problems and arrive at a consensus”, or similar proposals according to which there is a digital parliament of human proxies (e.g. Jan Leike’s “simulated deliberative democracy”).
I guess this defines a possible VEP, but I have long thought that a better VEP would involve theoretical identification of the existing “human decision procedure” (which I assume is a topic for cognitive neuroscience, and which in the individual is determined through a mix of genes, culture, and life incidents), and then extrapolating that. And again, the human decision procedure would in some way be a template, a schema whose details are “filled out” differently in different individuals (similar to how we learn the grammar and vocabulary of our native languages); and some of CEV’s extrapolation would depend on those details, some of it only on the structure of the schema.
You might even expect that Leike’s democracy would arrive at something like this, rather than just deciding everything via a vote among our extrapolated higher selves, forever. But then do you need the whole digression into upload societies devoted to the task of alignment? You just do AI-assisted neuroscience, figure out how human nature actually works, and “extrapolate” that.
Years ago, I thought that might be what would happen. Instead, the VEP that our frontier AI companies are employing, is to engage in value learning from the training corpora, as part of general world-modeling, and then refining and activating it with RHLF, constitutions, and so forth.
Am I wrong to think that if someone presented you with an alignment proposal roughly as handwavy as your argument in this post (including the linked wiki page and follow-up comments), you would be annoyed and consider it basically worthless? If not, where’s the key asymmetry that means I should find it reassuring anyway?
In the AI alignment case, I think I get why standards are so high: you expect that we’ll only have one chance, the solution has to actually be implemented rather than gestured at, failure means the loss of everything, and misplaced confidence could meaningfully increase risk.
It seems to me that the stakes are similarly high here, as we’re basically talking about someone elevating themself to god-emperor of the light cone. If it turns out that oops, they actually will retain and act on sadistic preferences, or the most efficient way to produce meat (or intelligence) actually does involve terrible suffering which they’ll be oblivious to or unconcerned about, then we could very easily end up with a world that is much worse than nothingness (by my values, and I think those of a significant number of reasonable people).
I’m not sure whether you would claim that you have made strong arguments against these possibilities and I’m wrong not to be convinced, or you would agree that you’ve mainly gestured at your own reasons not to worry so much (at least relative to the risk of unaligned AI takeover).
I don’t find it that reassuring! But also, complete reassurance seems a bit mistaken to aim for here. The kind of decision I am talking about is high stakes on both sides, so there isn’t any particularly obvious conservative action to take (of course, I think the actual thing we should do is not build ASI and not put anyone in this position for a long while, but that’s not the point of the post).
I also certainly wouldn’t consider someone thinking or writing about an alignment proposal in a similar way annoying or worthless. If you have some that you could write up in a similar fashion and depth, please do!
Also, not sure what you mean by the “linked wiki page” being “handwavy”? I mean, CEV is kind of tricky, but I certainly wouldn’t describe the whole thing as “handwavy”?
I guess overall, isn’t… this whole website full of relatively early-stage alignment proposals explained usually at a much lower level of depth?
The post’s opening line is this being a quick post that I hope someone else does a better job of sometime. I think it’s pretty reasonable to not be super compelled, and it certainly deserves a much longer and greater treatment.
Thanks for responding, and point taken that you don’t find it that reassuring and are okay with similarly incomplete alignment proposals.
When I called the CEV page ‘handwavy’, I didn’t mean it wasn’t a good-faith attempt to explain the concept. I think it’s handwavy relative to an account of exactly what it means, at the level required for me to understand how it would actually be implemented, why I should be happy with the consequences, and why I should expect it to emerge from the real-world process of a seemingly bad guy taking full control of an ASI. (Which I admit is a very high bar! But the stakes are high and my priors are low.)
I certainly would like to see much more work on CEV, though there are many things to do, and it’s been a good enough pointer for the purpose of many discussions like this as it is. But I certainly would not object, and would be excited, about someone making more progress on fleshing it out.
Well, I think this post is substantially intended to engage with people’s priors. Possibly I expressed one of my points better in this comment:
I think my biggest worry is not that we’ll end up ruled by someone who is actively sadistic over the long term (though that does terrify me and I don’t think it’s out of the question), but that we’ll end up ruled by someone who is basically indifferent to the suffering of some subset of others. Which seems very plausible to me, because it doesn’t require them to be a cartoon sadistic villain or even a literal psychopath; they just need to have the same tendency toward limited moral concern as most actual humans have, and to retain it through whatever process of uplift they undergo when interacting with their ASI.
Hopefully (though again I’m not confident), most normal humans would widen their circle appropriately in a situation where they were facing no competitive pressures, meaningful scarcity, avoidable ignorance, etc. But if we do end up with a psychopath in charge, I don’t see why they would move from indifference to caring; basically, I wouldn’t expect the is-ought gap to be bridged by whatever new knowledge and intelligence they gained.
In that second case, it seems to me that we need a lot of optimistic assumptions to hold in order to avoid an s-risk style catastrophe. If the ruler simply doesn’t care about the suffering they cause to whichever conscious entities constitute their outgroup, then we only need one of efficiency/ignorance/aesthetic preference/curiosity/other to lean slightly in favour of the horrible thing in order for it to happen.
I try to address a bit in the post. I do think the default expectation is that complete indifference towards a certain class of person, will just generalize to none of that kind of person existing. Why would they create lots of copies of things they don’t care about?
I’m thinking of animals too, and anything else conscious. So some possible reasons are the production of food and/or intelligence. (I know you sort of argued against the likely existence of suffering in those contexts, but not in enough detail for me to meaningfully update. And I find this point questionable:
It would take only miniscule amounts of caring if the required efficiency sacrifice is miniscule and there are no other contrary motives. In any case, I don’t think “a complete absence of that in humans with intact minds” is sufficiently unlikely. Psychopaths exist, sadists exist, and if we end up with a psychopath in charge, I think it’s entirely plausible that their concern for at least some subset of other conscious entities remains zero or negative; I don’t think you’ve really argued against this.)
A preference for authentic natural environments, combined with indifference to animal suffering (or slight concern outweighed by other concerns), could also lead to the production of immense amounts of suffering forever.
edit: I think there might be too much of a values gap (in that I’m much more negative utilitarian than you) for me to agree with your overall position even if you managed to convince me on most of the factual questions. I take this paragraph to imply that you see the eternal torture of at least thousands (and perhaps some larger number fewer than trillions) of people as a price worth paying for a future that is otherwise not so bad:
I know the amount of good stuff in this hypothetical future could be really, really big, and lots of people will think I’m just falling prey to scope insensitivity or something, but I’ve thought about this a lot and my considered position is that preventing the eternal torture is more important than bringing about the good stuff.
(I also don’t get the “a very specific mix of being wrong about game theory while taking an oddly enlightened perspective on other people’s values” part; it could be the simple fulfilment of a genuine sadistic preference.)
Yep, I think that’s a very confused moral position! I could argue here against it (as a random example, think about whether you would prefer to live a life that is 99.99999999% great and fulfilling, but once in 10,000 years you would experience a single 100ms of torture, which I think is likely an underestimate of the actual ratios here), but it seems like a big topic.
Certainly if you are a inclined to be a negative utilitarian then this post will not be very reassuring! Indeed almost any human-controlled future I think would end up looking quite bad, though it depends of course on whether you really are fully negative utilitarian.
This is only relevant given (at least) three assumptions, one about conscious experience and two about aggregation:
Being tortured for a long time and ‘tortured’ for 100ms differ only in length; there’s nothing in the experience of eternal (or very long) torture that distinguishes it from an infinite (or very large) number of isolated 100ms ‘tortures’
Good is separable (in the sense used by Broome) across time
Good is separable across people
If you’ve engaged seriously with this issue and are willing to write out an argument demonstrating that mine is a confused position, I will happily read and consider it! If not, I think you’re confusing “confused” with “disagrees with me on something I feel is obvious”.
(My position does require me to bite some actual bullets. But so does yours, and unless you’ve thought about this carefully enough to write about it for real, I suspect you’re underestimating how difficult it is to avoid all three of contradiction, vagueness, and weird/counterintuitive conclusions.)
I do not think psychopaths of this form exist. I might be wrong, but I certainly don’t think the evidence I’ve seen suggests to me there is variation this deep in how humans care about things. Most things in biology are on a spectrum, I would be surprised of psychopathy is not one of those. I maybe should write a general post about “why I don’t believe in most neat psychopathologies”.I do really wish this field of study was higher quality, and maybe I should do a deep dive and form a more consistent opinion on this. Every time I’ve dug into it I’ve been pretty deeply disappointed into what actual evidence we have for things like “there are people who intrinsically like hurting other people” and “there are people who are completely indifferent to the suffering of others”. It’s not that there is nothing, but it’s clear there is a demonization effect whenever you dive into the literature, where people really want to find categorically evil people, even if the evidence really doesn’t support that.Edit: Oops, I read the original quote here as “it’s entirely plausible that their concern for other conscious entities remains zero or negative;”. I think it’s quite likely there are people who have zero concern for some other people. I don’t think there are people who have zero concern for all other people.
[Edited to add a trigger warning for “one of the worst examples of evil”.]
You’re obviously right that personality is on a spectrum, but there’s still a tail!
There are people who try to get children on the internet to send them embarrassing photos, then extort the child with the material to perform sex acts or sadistic acts with siblings and record video, escalating into increasingly more sadistic and power-tripping stuff (like cutting themselves and writing with blood), after each time lying about the last ask having been the last, until often the children involved commit suicide because it doesn’t stop.
You can read in prosecutions that the perpetrators communicate with each other about the pleasure they take in it. Whatever you want to call these people, “concern for some conscious entities is zero or negative” describes the situation accurately, and the original quote you’re replying to was about that, not about whether Hare’s checklist carves nature at its joints.
One way to think of it is: there’s a spectrum of how Person A cares about Person B, and this spectrum goes from positive (compassion, desire to help) to neutral (callous indifference) to negative (schadenfreude, desire to pick a fight).
So “it’s a spectrum” is not in itself an argument for optimism here. (Or sorry if I’m misunderstanding.)
In case it helps, my take on the psychopathy literature is mostly the same as it was 3 years ago when I wrote this comment.
Agree to disagree on that for now I guess! I’d be interested in that deep dive if you end up doing it, though.
Eliezer wrote about the psychological unity of humankind. Though he seemingly disavows it now (human value differences also seems to be a theme in Planecrash).
To be clear, I’m not saying that Eliezer’s current views would make negative predictions about Putin’s CEV. I think a more central examples is that Eliezer now predicts (I think) that some people’s CEV is nothingness, because they hate suffering so much more than they like flourishing.
Other (fun?) cases to think about are whether you’d rather take your chances on the individual CEV of:
someone high on sociopathy, dark triad, or other non-neurotypical traits
someone who was severely traumatized as a child or adult, e.g. a prisoner of war or abuse victim
a Buddhist monk or nun who has lived a a life of extreme asceticism
an enthusiastic meth or heroin addict
someone very committed / deep into woo or spirituality (as part of an organized religion or not)
I currently think that being neurotypical along certain dimensions, non-traumatized, and not in some extreme[1] corner of spirituality / woo is probably somewhat more important (to producing good outcomes for other humans) vs. how morally good the extrapolee currently is (according to my values), as revealed by their words and actions.
Extrapolating the preferences of an underdeveloped entity (e.g. an animal or a baby) likely leaves most of the important bits unspecified and thus up to the extrapolater, and / or result in nothing of recognizable value to adult humans. And I agree that extrapolating the preferences of an LLM is much more likely to produce something very weird and likely valueless to humans.
i.e. most ordinary religious people of any religion would be fine, if not totally optimal
See also https://en.wikipedia.org/wiki/Fundamental_attribution_error . One of the most persistent and pernicious theory-of-mind failures in public discourse.
i’d guess there are a lot of people out there who genuinely do not love anyone, intrinsically enjoy exerting power over people, and are at best indifferent to making them suffer. this is correlated with power because one consequence of wielding a lot of power is you will inevitably hurt people, or even just fail to help people as much as you could have. if you are a normal person, this can incur a lot of psychic damage—this is why many people are scared of having power. if you enjoy it and don’t care about hurting people, this is actively attractive.
I don’t buy the “genuinely do not love anyone” assertion. I think this doesn’t match the profiles and biographies and Wikipedia summaries of almost anyone in power I have read. I don’t think it never happens, but it seems very rare, even from this somewhat elevated baseline.
I agree that many people intrinsically enjoy exerting power over people, and have a substantial amount of indifference to their suffering. I mention both of these in the posts as things that I don’t think would cause someone to largely squander the future. The arguments are not overwhelmingly strong against it, but I think you need to mess up a bunch more than just enjoying exerting power over people, or be indifferent to other people’s suffering.
Taking active joy in the full complexity of someone’s suffering would I think be the hard case. I currently think the kind of psychological profile necessary to arrive at that in stable reflective equilibrium is very rare (though not literally non-existent), and the few bits you get in this direction from someone being very powerful and broadly considered evil are not enough to end up with substantial probability on this. But it’s a tricky question!
Have you read John Wentworth’s posts…
(I think) different meanings of “love”!
I am interested if you do think there are many people whose CEV would be either zero-value or net negative? My first thought is that people who are deeply psychologically impaired or broken might be like this (
e.g. sociopaths—perhaps SBF), but I am not sure.Edit: I no longer think that SBF is an especially central/useful pointer here.
@Zach Stein-Perlman Where’s your thumbs-down coming from? SBF is the psychopath that I personally have met a few times, that people know of, and who got a lot of power, so it seems like a useful example to mention. Also don’t forget his personal diary entries read like this:
Reading your first comment above, I reacted as follows:
”What is going on, surely Putin is on the sociopathy/psychopathy spectrum too and not as endearing as SBF with effective altruism as an autistic special interest, so how is this even a question/comparison?”
(With SBF’s CEV I’d admittedly be quite concerned about the greed/recklessness inherent to his philosophy of utilitarianism and risk-taking, and I actually think that sort of thing could backfire uniquely-badly with AIs optimizing everything. But other than that, personality-wise, I just want to flag that while it’s probably always net bad to have people high on the sociopathy/psychopathy spectrum in leadership positions, there’s still a huge difference between the ones where your main concern is only large-scale fraud, vs the ones where the concern is unnecessary wars, genocide, and torture camps.)
I wonder, have many people on here not read the many examples of how harmful certain types of personality can be, for instance in this post and this one?
I have strong-downvoted both of those posts! I think they do a mixture of the “taking propaganda at face value” thing, seem to largely be in some kind of negative affect spiral around imagining arbitrarily bad qualities they can assign to people, seem to take an enormously naive view of the psychopathology literature, seem to consistently avoid dealing with the moral reflection dynamics I outline in this post, and seem very rooted in negative utilitarianism, which I disagree with[1].
Overall, I am pretty sad about that whole set of stuff. I somewhat wish I could talk more to people about it, but the negative utilitarianism-leaning stuff usually makes good-faith discourse a lot harder (since the bottom-line for my interlocutors seems clearly written).
Like man, those posts are so frustrating to read. They just… assert things. No epistemic statuses, no caveats, just stuff like this:
It honestly reads to me kind of like true crime podcasts? They love always describing everyone they cover with these kinds of extreme characterizations which basically never hold up when you look into them. My honest best guess is these posts serve a kind of similar role within our community.
I am confident the psychological characterization in paragraphs like the above is wrong. “Defends pre-existing ideology at all costs”. Come on, no, of course not “all costs”. Indeed a huge fraction of people caught up in these ideologies end up deconverting or drastically mellowing their views when their social context changes. These seem hilariously strawmanny absolutes that get ascribed to people here.
This is exactly the kind of thing I mean when I say people seem really got by propaganda, and when people engage in the fundamental attribution error. Like, I don’t know how I would differentiate paragraphs like the above from ravings of religious people talking about demons or devils.
It seems like in these posts all negative attributes must be assigned to these people, and all their cognitive leanings must be absolute and unwielding. They must “defend the pre-existing ideology at all costs”, they are “utterly convinced”, they are “textbook dogmatists”, they place “absolute faith” in the holy texts. Come on, this is not how people work.
and furthermore I think is evidence that something pretty deep is going wrong in how they are thinking about things
Thanks for sharing that take, which I find largely quite bizarre and surprising. I continue to think those posts are super valuable.
I can understand finding the negative(-leaning) utilitarianism codedness of the writings annoying, but I don’t see why you think it makes good-faith discussion a lot harder. From my perspective, a lot of writings on LW are “yay, hurrah team human”-coded in a way that annoys the crap out of me and makes me want to punch things, but it’s not like that means I can’t get valuable things out of the writings or have to treat the posters here as necessarily adversarial.
The subject of the sentence you put in quotation marks was “The fanatic”—as in, “the archetypal example of the fanatic.” This sounds to me like more of a writing style issue. Like, it wouldn’t even occur to me to assume that the post is saying that every believer of a harmful ideology is like that. More that there’s a fanaticism attractor and that people at the center of it really do approach that described extreme—which I think is true? I guess that’s the thing you’re contesting, but I feel like history contains some examples of atrocities that are hard to explain without choosing at least one of the following: either some people can get incredibly fanatical, or some people are sadistic/evil and may use fanaticism/ideology as a cover. Either way, at least one of the posts’ messages must be true?
There are some far-future relevance speculations where my model of what is likely to happen with AI is different from David’s, so I think it’s just much more likely that AIs will take over and it won’t matter what the humans who built it were like. But that doesn’t really invalidate too much—I mean part of the posts are also about what sort of qualities we wouldn’t want to see in AIs that we build. On your point about how AI-aided moral reflection and overcoming resource scarcity would reduce fanaticism or other bad consequences from “bad values,” I think that sort of point is underappreciated in some EA circles, but it’s not like it’s obvious that this is what’s going to happen. I think the posts we’re discussing engage well with reasons why reflection might not solve all the problems.
The rest of the post then tries to argue that these kinds of mental traits are relatively widespread, and argues from that incorrigibility and absolute fanaticism.
I agree an “archetype” definition could be fine. But it’s IMO clearly not what’s going on. The post makes no attempt at clarifying how far of an outlier all of the above descriptions are, and later includes sections like:
This is obviously absurd! 200-250 million Christians absolutely do not fit the description that I quoted in my comment above, which is the only definition they give for what they mean by “ideological fanatics”:
This is obviously not how this works. This is such a blatant example of category gerrymandering and the noncentral fallacy that I think my upset is very justified.
This is the definition of ideological fanatic as far as I can tell! You are telling me 200-250 million Christians worldwide fit this description?
I agree that’s too high, not on that strict description of fanaticism. But David also writes “For brevity, we focus here on support for ideological violence as the best proxy for ideological fanaticism.” So you may be right that there’s a bit of motte-and-bailey going on with who gets counted as “fanatical”. But I think the post is clear about what it is or isn’t saying.
And just to be clear, I think the true numbers would probably be somewhat shocking even for the strict/extreme definition. Like maybe a quarter of those 200-250 million, in my estimate. Africa still has witch burnings, some places in the US are very religious and almost every family there has this one (extended-)family member who is really fanatical, and I was mostly thinking about US big cities just now but in rural populations it’s probably even more pronounced.
I mean, I do think the post is clear, and the post is saying “tens of millions of people in the world consider any doubt or deviation from these dogmas as not only wrong but evil, culminating in a total “soldier mindset” which defends the pre-existing ideology at all costs. This necessitates abandoning even the most basic form of empiricism by “rejecting the evidence of one’s own eyes and ears”, to paraphrase Orwell. These people are thus essentially incorrigible and have no epistemic or moral uncertainty, even in the face of widespread opposition”
I agree the post is clear on that, but it also extremely unlikely to me.
And you seem to agree! The right number here is not 50-75 million people! Are you telling me that 50-75 million people have “abandoned even the most basic form of empiricism”, “defending their pre-existing ideology at all costs”?
This just doesn’t match any historical conflicts or ideological social movements. Yes, group dynamics can drive people into doing weird and aggressive things, but these are descriptions of what individual people would do, and that a substantial fraction of them would be incapable of reform. In contrast to that, in the moment those group dynamics end, the vast majority of people caught up in these things turn out to be normal and well-adjusted, with the above being a terrible description.
I don’t think I’ve ever met someone who fits the description above. One of the key lines is that “they are incorrigible and have no epistemic or moral uncertainty, even in the face of widespread opposition”. But that doesn’t match what usually goes on here at all. The vast majority of people behaving fanatically would stop doing so when facing widespread opposition, and this is certainly true of the family members I’ve seen here. People are religious fundamentalists because the people around them are. If the people around them stop and start pushing back, a very very small fraction of people would end up insisting for the rest of their lives on their previous beliefs.
I have met Christians like that and I don’t see why those numbers would be too high. We live in extreme filter bubbles.
The language generally does sound a little strong, but I’d guess it to be directionally correct and that your points wouldn’t significantly change the post’s conclusions (though I admit I’ve only skimmed the post). Like, it’s true that a lot of people will change their minds if the social context changes, but if the ideology manages to maintain a stable-enough social context or one that shifts adaptively enough, then those people’s attitudes can stay quite resilient.
And even if huge numbers of people did change their minds, it’s possible for some not to, e.g. because their psychology for one reason or another ends up leaving them no line of retreat, so that anything ends up being less painful than changing one’s mind. The post also notes that it may be enough for a pretty small number of people to be fanatics, if those people end up in control of a state.
Generally I think that Duncan’s heuristic of betting on existence is a pretty good one that’s generally correct, and also applies for the case of “extreme fanatics do exist”.
This seems like a strawman? The post was assigning some negative attributes to these people, not all of them. For one, extreme tribalism implies a loyalty to your own tribe, which is generally seen as a positive virtue. (Stereotypical demons don’t even have that quality.)
How to differentiate this from talk about demons or devils—well, most obviously, demons and devils are supernatural and incompatible with any naturalistic understanding of the world. Some variables within the brain getting stuck in an extreme setting is not. (I do find it a plausible claim that e.g. a superintelligence capable of arbitrarily manipulating such a person’s environment could always find some way of getting the person to change their mind, but I think the post is most reasonably read as “incapable of changing their minds for most practical purposes”.)
[...]
I am not arguing against “there are literally no people who could meaningfully be described as having these attributes”, but the post goes on to classify over 500 million people worldwide as having these attributes!
My guess is you would agree with me here if you read the article beyond skimming.
Again, their only definition of an “ideological fanatic” is that extremely extremely strong summary I linked above. There is no section of the post that’s like “of course, the vast vast majority of ideological fanatics do not think anything like this and are not well-described by this, and are largely behaving this way due to social momentum, and are maybe mildly on a spectrum in this direction”. It just creates an extremely intense boogeyman of “ideological fanatics” then classifies ~10% of the world population as matching that description.
I think these excerpts are saying something like that?
I agree these talk about potential scales, but it seems to me that they describe “ideological fanatics” as a pretty extreme point on that scale, and then classify at least hundreds of millions of people as falling on that side of the scale?
These are not the right quantitative adjectives here. The correct ones to use would be:
Practically no people we classify as “ideological fanatics” are true believers
Virtually all (of the 500M+) fanatics are capable of eventual reform
Like, I don’t see how you could use the original quantifiers (“not all” and “many”), and after multiplying them through with the quantitative estimates not arrive at numbers at least 3 OOMs too high for these traits.
I agree that the original text is ambiguous in this regard and that there are reasonable grounds for your reading. Personally I interpret sentences like
to mean something like “we are giving descriptions for the most extreme version of ideological fanaticism as that’s the easiest to gesture toward, but will also include people with less extreme versions when trying to estimate the sizes of the movements”.
I think it’s also relevant that the section on dogmatic certainty that you quoted beings with
implying that not all fanatics are this extreme; and of course the section on dogmatic certainty was one of the three sub-dimensions for ideological fanaticism rather than the overall definition.
But I think “which one of these readings is more correct” gets pretty subjective and impossible to resolve, so for me the more important test is something like… “if the authors could be read as making either a strong claim or a weaker one, how much do their conclusions depend on the stronger claim?”.
And it seems to me that on an interpretation where they only mean something like “virtually all fanatics are capable of eventual reform in principle, but in practice may be stuck in an environment where that is very unlikely”… then the various dangers that they outline, like “Ideological fanaticism increases the risk of war and conflict” or “Fanatical retributivism may lead to astronomical suffering”, still sound plausible.
Of course, it’s still fair to criticize the authors for e.g. being unclear about this or for implying a stronger claim when a weaker claim would suffice, but I wouldn’t strong-downvote them for that.
With regard to the bit about the 200-250 million Christians, that section does contain this paragraph
I think a reasonable reading of this section/paragraph is also something like “for purposes of this section, we are defining an ideological fanatic as having some combination of these three traits that’s high enough for them to endorse ideological violence” [not implying that they would necessarily all be maxed out on the “dogmatic certainty” dimension].
Here we could apply a similar test of… “if we interpret the 200-250 million Christians narrowly as only being people who endorse ideological violence, rather than assuming that they’re necessarily incapable of changing their minds, does this still e.g. increase the risk of war and conflict?”. I think the answer is pretty clearly yes.
I have little understanding of Putin’s personality. This is why I thought SBF was a better example, because we have more detailed understanding of him—for instance, we have his diary entries! He says it’s likely that all his emotions and empathy are fake! I understand that Putin has probably committed heinous acts, but I am not as aware of evidence he is biologically as impaired emotionally/psychologically as SBF is.
Mm, I am not convinced that if SBF wasn’t made the leader of Russia he would not do much more evil than Putin. My impression is that SBF is morally shameless, more competent, and quite unusually ideologically committed to “the ends justify the means”.
Actually, I chatted with an LLM for a bit, and I changed my mind, I no longer think SBF is an especially good/central example of a psychopath. (link to my chat with chatgpt)
“He says it’s likely that all his emotions and empathy are fake!”
I don’t get why you think that’s such a big deal. A lot of people are like that. My guess is something like 5%. A lot of people who are like that don’t admit it. Surely dictators are like 50% likely to be like that just on priors, and then you can add KGB history for Putin.
I feel like you’re overupdating based on SBF admitting something, while not inferring things about Putin (or other dictators) based on past behavior and based on the demands of their role (and getting there).
I mean I understand not knowing much about Putin specifically; if I’m honest, I also don’t know much/couldn’t give you detailed examples, but I’m actually somewhat familiar with KGB history due to an interest in Cold War spy stories, and it’s been said that Putin was an exemplary KGB specimen or whatever, so, we can probably infer with 99.9% confidence that he thinks “ends justify the means” too, because imagine being in the KGB and voicing deontological objections to your superiors, do you think you’re going to rise up through the ranks?
BTW my model of that sort of personality is that when someone says the things that SBF wrote about himself, it’s still compatible with them having genuine feelings of fondness (though somewhat faint rather than all-consuming) for a person (or animal) or two in their lives. And maybe that’s why habryka thinks these personality traits are overpainted/demonized. I even agree that some people might be too categorically negative about the idea that some people on the sociopathy/psychopathy spectrum may actually be alright at least if you’re able to contain some of their bad patterns (like lying). But for the most part, I’d say it still makes for bad leadership and stewardship of others when someone is like that even in the more benign expressions, and we haven’t even gotten into the topic of extreme sadism and tails generally yet, which by the looks of it (my comment here being disagree-voted and the lack of logic in “it’s a spectrum” arguments also pointed out by Steven Byrnes, and surrounding discussion there generally) some people here seem to be in denial about. I don’t understand what’s going on.
Edit: I looked into figures a bit and I think it’s more like 3% for all population, but 5% for men specifically feels like the right estimate to me. And this is assuming “blunted emotionality” rather than “literally has no emotions ever”.
I disagree-voted it, mostly because my strong guess (based on having done that for a bunch of other crimes like this) is that the actual drivers of the crime you are talking here about won’t actually be well-characterized as the kind of sadism you are talking about. It would require digging into the details, and it didn’t seem worth it to me to do that, so just a disagreement-vote seemed most appropriate.
If you end up looking into it (e.g., you could talk to Claude starting with a prompt and our recent comments here) and change your mind (or not), please let me know. I suggest doing so on a day where you’re not necessarily planning to get a lot of work done, because reading about this stuff really weighs you down. Unfortunately the sadism component is on-the-nose.
I agree with you that it’s often the case that the media paints people as evil where other stuff is going on rather than just “evil personality full stop” (like the intense hatred towards mothers who harm their children when they suffer from extreme postnatal depression, or have mental problems that generate Munchausen by proxy expressions). But sometimes people really like torturing others for fun and there’s ample documentation of that sort of personality not just in the sextortion cases I alluded to, but throughout history when you read about places that used torture (not even just the victims saying that the torturers seemed to enjoy it, sometimes the torturers write about it themselves).
I wonder if maybe there’s a selection effect where the media kind of stops reporting on things that get too shocking, meaning where extreme sadism is involved, so if you just go by shocking media examples, it’s possible to miss the tails. But it’s different with history where historians often go to great lengths highlighting how bad the atrocities were in some times and places.
I think SBF is an example that would be hard to debate in a neutral way, due to the beef many people will have with him, and the stronger feelings people have as someone who was/is close to this community. I share some distaste for trying to use him as an example in this context.
Ok, fair enough, I can see that it would overall be easier for the discourse to stick to examples that are less local.
I talk a bit about this in this section:
And then certainly it doesn’t seem that hard to imagine people who CEV is zero-value, or in a cosmic sense close to it. It seems like there will be at least some people who end up not caring for the cosmos, who just want to live out their live in peace, and then die, in the way they were expecting to for most of their life. Hopefully such a person would end up giving over the cosmos so someone who does care, but if they don’t then that would qualify.
This is gonna sound like the most caveman grug tier question. But like, would Vladimir Putin just force every woman on earth in his harem? Cause that sounds pretty bad from my CEV
Your priors are reasonable. The CEV of a random human is closer to my CEV than that of a random non-human mammal, or a random current LLM. The evidence of Putin’s behavior doesn’t move your beliefs much. So, you would prefer Putin’s CEV to Claude’s CEV. It’s hypothetical because we don’t have a way to achieve either, today.
(I wrote a List of Human Lethalities draft, but I don’t think it’s novel)
In 2030, if we are alive and have an intent-aligned AI, we must have made huge strides in interpretability and alignment. At that point we will also have a lot more evidence about virtue-aligned AIs and a lot better at aligning them to virtue. We won’t have any more evidence about humans. So in 2030 it will be better to hand the intent-aligned AI to the best virtue-aligned AI, “Viraj”, than to a human. Or, equivalently, hand control directly to Viraj.
In that hypothetical 2030, it would be sad if a human took control of the intent-aligned AI instead of Viraj. We can avoid this sadness by not training intent-aligned AIs and instead only training virtue-aligned AIs. This also improves the prospects for cooperation. Stealing or launching an incompletely aligned virtue-aligned AI is less effective for the defector, and it’s possible to collaborate on the intended virtues.
You don’t see this norm as simply the leadership colluding to keep sending the commoners they rule over to the meat grinder to reduce their own personal risk?
It might very well be the reason for the norm! But I think international diplomacy and governance would be even harder than it already is if all leaders were likely to get assassinated during their term, or had to take even more extreme precautions to avoid that than they already do. So independently of the reason, I support it.
Putin hates people he considers traitors, and his definition of such is expansive. There would be an individual S risk for those who would seek to dethrone him. However, he does appear to want immortality for himself, and I suspect this extends to the population at large. Someone like Elon Musk, though much higher in openness than Putin, is also personally vindictive, but is ideologically opposed to life extension. So I am not sure that Putin as AI overlord would be a highly subpar outcome relative to Musk as AI overlord, or even at all. (Certainly I would consider Putin far superior to movements or people who fundamentally reject modernity, such as Islamists or Far Right trads and Nazis. And likewise obviously far inferior to the frontier AI labs, EA/LW, and conventional liberal democratic factions like the Dems and Eurocrats.) Obviously Putin would immediately move to fulfill his particular world optimization visions, so no more Ukraine (or Belarus, or probably independent Baltics), but then again, any post-AGI environment be it under Putin or anyone else will quickly become so weird that I don’t know if pre-AGI geopolitical obsessions will remain relevant for long. Indeed, it seems unlikely that any AI overlord would long remain consumed by questions of who owns what clay in the face of the cosmic possibilities that are unlocked to them.
(This comment feels a bit too political for my tastes. I used Putin here mostly because of the hilarious Kelsey tweet, I don’t actually think this conversation will go well if we start trying to compare all the different ideologies and how likely their leaders are to fuck things up)
OK, I will refrain from continuing this thread beyond this reply, but I would like to more fully expound on this idea before I go. I think such comparisons are useful and important because the list of plausible candidates for AI overlordship is actually quite small, so their personalities and politics can be meaningfully discussed in this context. This list includes the handful of frontier labs and their CEOs; Xi/Xi’s successor/”the CPC”; Musk; Trump; Vance, Rubio, Newsom, and the half dozen other Americans who might plausibly be President 2028-32; the “US”… and beyond that, it rapidly diffuses out into much larger collectives such as “The Internet”, “humanity”, “the noosphere”—or banal extinction. (Incidentally, though, I do agree that Putin is barely worth talking about because Russia’s chances of being first to AGI are ~0%). People who attain outsized political and business success tend to be much more “odd” than the population average, it is genuinely difficult to explain much of what is happening in both domestic politics and geopolitics without accounting for their psychological quirks, and I think it is very plausible that the impact of these individual personality factors would if anything be magnified to new extremes were they to be given the opportunity to emanate their CEV across all of humanity.