I want to keep picking a fight about “will the AI care so little about humans that it just kills them all?” This is different from a broader sense of cosmopolitanism, and moreover I’m not objecting to the narrow claim “doesn’t come for free.” But it’s directly related to the actual emotional content of your parables and paragraphs, and it keeps coming up recently with you and Eliezer, and I think it’s an important way that this particular post looks wrong even if the literal claim is trivially true.
(Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that’s likely to be a mistake even if it doesn’t lead to billions of deaths.)
Humans care about the preferences of other agents they interact with (not much, just a little bit!), even when those agents are weak enough to be powerless. It’s not just that we have some preferences about the aesthetics of cows, which could be better optimized by having some highly optimized cow-shaped objects. It’s that we actually care (a little bit!) about the actual cows getting what they actually want, trying our best to understand their preferences and act on them and not to do something that they would regard as crazy and perverse if they understood it.
If we kill the cows, it’s because killing them meaningfully helped us achieve some other goals. We won’t kill them for arbitrarily insignificant reasons. In fact I think it’s safe to say that we’d collectively allocate much more than 1/millionth of our resources towards protecting the preferences of whatever weak agents happen to exist in the world (obviously the cows get only a small fraction of that).
Before really getting into it, some caveats about what I want to talk about:
I don’t want to focus on whatever form of altruism you and Eliezer in particular have (which might or might not be more dependent on some potentially-idiosyncratic notion of “sentience.”) I want to talk about caring about whatever weak agents happen to actually exist, which I think is reasonably common amongst humans. Let’s call that “kindness” for the purpose of this comment. I don’t think it’s a great term but it’s the best short handle I have.
I’ll talk informally about how quantitatively kind an agent is, by which I mean something like: how much of its resources it would allocate to helping weak agents get what they want? How highly does it weigh that part of its preferences against other parts? To the extent it can be modeled as an economy of subagents, what fraction of them are kind (or were kind pre-bargain)?
I don’t want to talk about whether the aliens would be very kind.I specifically want to talk about tiny levels of kindness, sufficient to make a trivial effort to make life good for a weak species you encounter but not sufficient to make big sacrifices on its behalf.
I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.
You and Eliezer seem to think there’s a 90% chance that AI will be <1/trillion (perhaps even a 90% chance that they have exactly 0 kindness?). But we have one example of a smart mind, and in fact: (i) it has tons of diverse shards of preference-on-reflection, varying across and within individuals (ii) it has >1/million kindness. So it’s superficially striking to be confident AI systems will have a million times less kindness.
I have no idea under what conditions evolved or selected life would be kind. The more preferences are messy with lots of moving pieces, the more probable it is that at least 1/trillion of those preferences are kind (since the less correlated the trillion different shards of preference are with one another and so the more chances you get). And the selection pressure against small levels of kindness is ~trivial, so this is mostly a question about idiosyncrasies and inductive biases of minds rather than anything that can be settled by an appeal to selection dynamics.
I can’t tell if you think kindness is rare amongst aliens, or if you think it’s common amongst aliens but rare amongst AIs. Either way, I would like to understand why you think that. What is it that makes humans so weird in this way?
(And maybe I’m being unfair here by lumping you and Eliezer together—maybe in the previous post you were just talking about how the hypothetical AI that had 0 kindness would kill us, and in this post how kindness isn’t guaranteed. But you give really strong vibes in your writing, including this post. And in other places I think you do say things that don’t actually add up unless you think that AI is very likely to be <1/trillion kind. But at any rate, if this post is unfair to you, then you can just sympathize and consider it directed at Eliezer instead who lays out this position much more explicitly though not in a convenient place to engage with.)
Here are some arguments you could make that kindness is unlikely, and my objections:
“We can’t solve alignment at all.” But evolution is making no deliberate effort to make humans kind, so this is a non-sequitur.
“This is like a Texas sharpshooter hitting the side of a barn then drawing a target around the point they hit; every evolved creature might decide that their own idiosyncrasies are common but in reality none of them are.” But all the evolved creatures wonder if a powerful AI they built would kill them or if if it would it be kind. So we’re all asking the same question, we’re not changing the question based on our own idiosyncratic properties. This would have been a bias if we’d said: humans like art, so probably our AI will like art too. In that case the fact that we were interested in “art” was downstream of the fact that humans had this property. But for kindness I think we just have n=1 sample of observing a kind mind, without any analogous selection effect undermining the inference.
“Kindness is just a consequences of misfiring [kindness for kin / attachment to babies / whatever other simple story].” AI will be selected in its own ways that could give rise to kindness (e.g. being selected to do things that humans like, or to appear kind). The a priori argument for why that selection would lead to kindness seems about as good as the a priori argument for humans. And on the other side, the incentives for humans to be not kind seem if anything stronger than the incentives for ML systems to not be kind. This mostly seems like ungrounded evolutionary psychology, though maybe there are some persuasive arguments or evidence I’ve just never seen.
“Kindness is a result of the suboptimality inherent in compressing a brain down into a genome.” ML systems are suboptimal in their own random set of ways, and I’ve never seen any persuasive argument that one kind of suboptimality would lead to kindness and the other wouldn’t (I think the reverse direction is equally plausible). Note also that humans absolutely can distinguish powerful agents from weak agents, and they can distinguish kin from unrelated weak agents, and yet we care a little bit about all of them. So the super naive arguments for suboptimality (that might have appealed to information bottlenecks in a more straightforward way) just don’t work. We are really playing a kind of complicated guessing game about what is easy for SGD vs easy for a genome shaping human development.
“Kindness seems like it should be rare a priori, we can’t update that much from n=1.” But the a priori argument is a poorly grounded guess about about the inductive biases of spaces of possible minds (and genomes), since the levels of kindness we are talking about are too small to be under meaningful direct selection pressure. So I don’t think the a priori arguments are even as strong as the n=1 observation. On top of that, the more that preferences are diverse and incoherent the more chances you have to get some kindness in the mix, so you’d have to be even more confident in your a priori reasoning.
“Kindness is a totally random thing, just like maximizing squiggles, so it should represent a vanishingly small fraction of generic preferences, much less than 1/trillion.” Setting aside my a priori objections to this argument, we have an actual observation of an evolved mind having >1/million kindness. So evidently it’s just not that rare, and the other points on this list respond to various objections you might have used to try to salvage the claim that kindness is super rare despite occurring in humans (this isn’t analogous to a Texas sharpshooter, there aren’t great debunking explanation for why humans but not ML would be kind, etc.). See this twitter thread where I think Eliezer is really off base, both on this point and on the relevance of diverse and incoherent goals to the discussion.
Note that in this comment I’m not touching on acausal trade (with successful humans) or ECL. I think those are very relevant to whether AI systems kill everyone, but are less related to this implicit claim about kindness which comes across in your parables (since acausally trading AIs are basically analogous to the ants who don’t kill us because we have power).
A final note, more explicitly lumping you with Eliezer: if we can’t get on the same page about our predictions I’m at at least aiming to get folks to stop arguing so confidently for death given takeover. It’s easy to argue that AI takeover is very scary for humans, has a significant probability of killing billions of humans from rapid industrialization and conflict, and is a really weighty decision even if we don’t all die and it’s “just” handing over control over the universe. Arguing that P(death|takeover) is 100% rather than 50% doesn’t improve your case very much, but it means that doomers are often getting into fights where I think they look unreasonable.
I think OP’s broader point seems more important and defensible: “cosmopolitanism isn’t free” is a load-bearing step in explaining why handing over the universe to AI is a weighty decision. I’d just like to decouple it from “complete lack of kindness.”
Eliezer has a longer explanation of his view here.
My understanding of his argument is: there are a lot of contingencies that reflect how and whether humans are kind. Because there are so many contingencies, it is somewhat unlikely that aliens would go down a similar route, and essentially impossible for ML. So maybe aliens have a 5% probability of being nice and ML systems have ~0% probability of being nice. I think this argument is just talking about why we shouldn’t have update too much from humans, and there is an important background assumption that kindness is super weird and so won’t be produced very often by other processes, i.e. the only reason to think it might happen is that it happened in the single case we observed.
I find this pretty unconvincing. He lists like 10 things (humans need to trade favors, we’re not smart enough to track favors and kinship explicitly, and we tend to be allied with nearby humans so want to be nice to those around us, we use empathy to model other humans, and we had religion and moral realism for contingent reasons, we weren’t optimized too much once we were smart enough that our instrumental reasoning screens off kindness heuristics).
But no argument is given for why these are unusually kindness-inducing settings of the variables. And the outcome isn’t like a special combination of all of them, they each seem like factors that contribute randomly. It’s just a lot of stuff mixing together.
Presumably there is no process that ensures humans have lots of kindness-inducing features (and we didn’t select kindness as a property for which humans were notable, we’re just asking the civilization-independent question “does our AI kill us”). So if you list 10 random things that make humans more kind, it strongly suggests that other aliens will also have a bunch of random things that make them more kind. It might not be 10, and the net effect might be larger or smaller. But:
I have no idea whatsoever how you are anchoring this distribution, and giving it a narrow enough spread to have confident predictions.
Statements like “kindness is super weird” are wildly implausible if you’ve just listed 5 independent plausible mechanisms for generating kindness. You are making detailed quantitative guesses here, not ruling something out for any plausible a priori reason.
As a matter of formal reasoning, listing more and more contingencies that combine apparently-additively tends to decrease rather than increase the variance of kindness across the population. If there was just a single random thing about humans that drove kindness it would be more plausible that we’re extreme. If you are listing 10 things then things are going to start averaging out (and you expect that your 10 things are cherry-picked to be the ones most relevant to humans, but you can easily list 10 more candidates).
In fact it’s easy to list analogous things that could apply to ML (and I can imagine the identical conversation where hypothetical systems trained by ML are talking about how stupid it is to think that evolved life could end up being kind). Most obviously, they are trained in an environment where being kind to humans is a very good instrumental strategy. But they are also trained to closely imitate humans who are known to be kind, they’ve been operating in a social environment where they are very strongly expected to appear to be kind, etc. Eliezer seems to believe this kind of thing gets you “ice cream and condoms” instead of kindness OOD, but just one sentence ago he explained why similar (indeed, superficially much weaker!) factors led to humans retaining niceness out of distribution. I just don’t think we have the kind of a priori asymmetry or argument here that would make you think humans are way kinder than models. Yeah it can get you to ~50% or even somewhat lower, but ~0% seems like a joke.
There was one argument that I found compelling, which I would summarize as: humans were optimized while they were dumb. If evolution had kept optimizing us while we got smart, eventually we would have stopped being so kind. In ML we just keep on optimizing as the system gets smart. I think this doesn’t really work unless being kind is a competitive disadvantage for ML systems on the training distribution. But I do agree that if if you train your AI long enough on cases where being kind is a significant liability, it will eventually stop being kind.
Short version: I don’t buy that humans are “micro-pseudokind” in your sense; if you say “for just $5 you could have all the fish have their preferences satisfied” I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.
Meta:
Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that’s likely to be a mistake even if it doesn’t lead to billions of deaths.
So for starters, thanks for making acknowledgements about places we apparently agree, or otherwise attempting to demonstrate that you’ve heard my point before bringing up other points you want to argue about. (I think this makes arguments go better.) (I’ll attempt some of that myself below.)
Secondly, note that it sounds to me like you took a diametric-opposite reading of some of my intended emotional content (which I acknowledge demonstrates flaws in my writing). For instance, I intended the sentence “At that very moment they hear the dinging sound of an egg-timer, as the next-token-predictor ascends to superintelligence and bursts out of its confines” to be a caricature so blatant as to underscore the point that I wasn’t making arguments about takeoff speeds, but was instead focusing on the point about “complexity” not being a saving grace (and “monomaniacalism” not being the issue here). (Alternatively, perhaps I misunderstand what things you call the “emotional content” and how you’re reading it.)
Thirdly, I note that for whatever it’s worth, when I go to new communities and argue this stuff, I don’t try to argue people into >95% change we’re all going to die in <20 years. I just try to present the arguments as I see them (without hiding the extremity of my own beliefs, nor while particularly expecting to get people to a similarly-extreme place with, say, a 30min talk). My 30min talk targets are usually something more like “>5% probability of existential catastrophe in <20y”. So insofar as you’re like “I’m aiming to get you to stop arguing so confidently for death given takeover”, you might already have met your aims in my case.
(Or perhaps not! Perhaps there’s plenty of emotional-content leaking through given the extremity of my own beliefs, that you find particularly detrimental. To which the solution is of course discussion on the object-level, which I’ll turn to momentarily.)
Object:
First, I acknowledge that if an AI cares enough to spend one trillionth of its resources on the satisfaction of fulfilling the preferences of existing “weak agents” in precisely the right way, then there’s a decent chance that current humans experience an enjoyable future.
With regards to your arguments about what you term “kindness” and I shall term “pseudokindness” (on account of thinking that “kindness” brings too much baggage), here’s a variety of places that it sounds like we might disagree:
Pseudokindness seems underdefined, to me, and I expect that many ways of defining it don’t lead to anything like good outcomes for existing humans.
Suppose the AI is like “I am pico-pseudokind; I will dedicate a trillionth of my resources to satisfying the preferences of existing weak agents by granting those existing weak agents their wishes”, and then only the most careful and conscientious humans manage to use those wishes in ways that leave them alive and well.
There are lots and lots of ways to “satisfy the preferences” of the “weak agents” that are humans. Getting precisely the CEV (or whatever it should be repaired into) is a subtle business. Most humans probably don’t yet recognize that they could or should prefer taking their CEV over various more haphazard preference-fulfilments that ultimately leave them unrecognizable and broken. (Or, consider what happens when a pseudokind AI encounters a baby, and seeks to satisfy its preferences. Does it have the baby age?)
You’ve got to do some philosophy to satisfy the preferences of humans correctly. And the issue isn’t that the AI couldn’t solve those philosophy problems correctly-according-to-us, it’s that once we see how wide the space of “possible ways to be pseudokind” is, then “pseudokind in the manner that gives us our CEVs” starts to feel pretty narrow against “pseudokind in the manner that fulfills our revealed preferences, or our stated preferences, or the poorly-considered preferences of philosophically-immature people, or whatever”.
I doubt that humans are micro-pseudokind, as defined. And so in particular, all your arguments of the form “but we’ve seen it arise once” seem suspect to me.
Like, suppose we met fledgeling aliens, and had the opportunity to either fulfil their desires, or leave them alone to mature, or affect their development by teaching them the meaning of friendship. My guess is that we’d teach them the meaning of friendship. I doubt we’d hop in and fulfil their desires.
(Perhaps you’d counter with something like: well if it was super cheap, we might make two copies of the alien civilization, and fulfil one’s desires and teach the other the meaning of friendship. I’m skeptical, for various reasons.)
More generally, even though “one (mill|trill)ionth” feels like a small fraction, the obvious ways to avoid dedicating even a (mill|trill)ionth of your resources to X is if X is right near something even better that you might as well spend the resources on instead.
There’s all sorts of ways to thumb the scales in how a weak agent develops, and there’s many degrees of freedom about what counts as a “pseudo-agent” or what counts as “doing justice to its preferences”, and my read is that humans take one particular contingent set of parameters here and AIs are likely to take another (and that the AI’s other-settings are likely to lead to behavior not-relevantly-distinct from killing everyone).
My read is than insofar as humans do have preferences about doing right by other weak agents, they have all sorts of desire-to-thumb-the-scales mixed in (such that humans are not actually pseudokind, for all that they might be kind).
I have a more-difficult-to-articulate sense that “maybe the AI ends up pseudokind in just the right way such that it gives us a (small, limited, ultimately-childless) glorious transhumanist future” is the sort of thing that reality gets to say “lol no” to, once you learn more details about how the thing works internally.
Most of my argument here is that “the space of ways things can end “caring” about the “preferences” of “weak agents” is wide, and most points within it don’t end up being our point in it, and optimizing towards most points in it doesn’t end up keeping us around at the extremes. My guess is mostly that the space is so wide that you don’t even end up with AIs warping existing humans into unrecognizable states, but do in fact just end up with the people dead (modulo distant aliens buying copies, etc).
I haven’t really tried to quantify how confident I am of this; I’m not sure whether I’d go above 90%, \shrug.
It occurs to me that one possible source of disagreement here is, perhaps you’re trying to say something like:
Nate, you shouldn’t go around saying “if we don’t competently intervene, literally everybody will die” with such a confident tone, when you in fact think there’s a decent chance of scenarios where the AIs keep people around in some form, and make some sort of effort towards fulfilling their desires; most people don’t care about the cosmic endowment like you do; the bluntly-honest and non-manipulative thing to say is that there’s a decent chance they’ll die and a better chance that humanity will lose the cosmic endowment (as you care about more than they do),
whereas my stance has been more like
most people I meet are skeptical that uploads count as them; most people would consider scenarios where their bodies are destroyed by rapid industrialization of Earth but a backup of their brain is stored and then later run in simulation (where perhaps it’s massaged into an unrecognizable form, or kept in an alien zoo, or granted a lovely future on account of distant benefactors, or …) to count as “death”; and also those exotic scenarios don’t seem all that likely to me, so it hasn’t seemed worth caveating.
I’m somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.
I’m considering adding footnotes like “note that when I say “I expect everyone to die”, I don’t necessarily mean “without ever some simulation of that human being run again”, although I mostly don’t think this is a particularly comforting caveat”, in the relevant places. I’m curious to what degree that would satisfy your aims (and I welcome workshopped wording on the footnotes, as might both help me make better footnotes and help me understand better where you’re coming from).
I disagree with this but am happy your position is laid out. I’ll just try to give my overall understanding and reply to two points.
Like Oliver, it seems like you are implying:
Humans may be nice to other creatures in some sense, But if the fish were to look at the future that we’d achieve for them using the 1/billionth of resources we spent on helping them, it would be as objectionable to them as “murder everyone” is to us.
I think that normal people being pseudokind in a common-sensical way would instead say:
If we are trying to help some creatures, but those creatures really dislike the proposed way we are “helping” them, then we should try a different tactic for helping them.
I think that some utilitarians (without reflection) plausibly would “help the humans” in a way that most humans consider as bad as being murdered. But I think this is an unusual feature of utilitarians, and most people would consult the beneficiaries, observe they don’t want to be murdered, and so not murder them.
I think that saying “Helping someone in a way they like, sufficiently precisely to avoid things like murdering them, requires precisely the right form of caring—and that’s super rare” is a really misleading sense of how values work and what targets are narrow. I think this is more obvious if you are talking about how humans would treat a weaker species. If that’s the state of the disagreement I’m happy to leave it there.
I’m somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.
This is an important distinction at 1/trillion levels of kindness, but at 1/billion levels of kindness I don’t even think the humans have to die.
If we are trying to help some creatures, but those creatures really dislike the proposed way we are “helping” them, then we should do something else.
My picture is less like “the creatures really dislike the proposed help”, and more like “the creatures don’t have terribly consistent preferences, and endorse each step of the chain, and wind up somewhere that they wouldn’t have endorsed if you first extrapolated their volition (but nobody’s extrapolating their volition or checking against that)”.
It sounds to me like your stance is something like “there’s a decent chance that most practically-buildable minds pico-care about correctly extrapolating the volition of various weak agents and fulfilling that extrapolated volition”, which I am much more skeptical of than the weaker “most practically-buildable minds pico-care about satisfying the preferences of weak agents in some sense”.
We’re not talking about practically building minds right now, we are talking about humans.
We’re not talking about “extrapolating volition” in general. We are talking about whether—in attempting to help a creature with preferences about as coherent as human preferences—you end up implementing an outcome that creature considers as bad as death.
For example, we are talking about what would happen if humans were trying to be kind to a weaker species that they had no reason to kill, that could nevertheless communicate clearly and had preferences about as coherent as human preferences (while being very alien).
And those creatures are having a conversation amongst themselves before the humans arrive wondering “Are the humans going to murder us all?” And one of them is saying “I don’t know, they don’t actually benefit from murdering us and they seem to care a tiny bit about being nice, maybe they’ll just let us do our thing with 1/trillionth of the universe’s resources?” while another is saying “They will definitely have strong opinions about what our society should look like and the kind of transformation they implement is about as bad by our lights as being murdered.”
In practice attempts to respect someone’s preferences often involve ideas like autonomy and self-determination and respect for their local preferences. I really don’t think you have to go all the way to extrapolated volition in order to avoid killing everyone.
Humans wound up caring at least a little about satisfying the preferences of other creatures, not in a “grant their local wishes even if that ruins them” sort of way but in some other intuitively-reasonable manner.
Humans are the only minds we’ve seen so far, and so having seen this once, maybe we start with a 50%-or-so chance that it will happen again.
You can then maybe drive this down a fair bit by arguing about how the content looks contingent on the particulars of how humans developed or whatever, and maybe that can drive you down to 10%, but it shouldn’t be able to drive you down to 0.1%, especially not if we’re talking only about incredibly weak preferences.
If so, one guess is that a bunch of disagreement lurks in this “intuitively-reasonable manner” business.
A possible locus of disagreemet: it looks to me like, if you give humans power before you give them wisdom, it’s pretty easy to wreck them while simply fulfilling their preferences. (Ex: lots of teens have dumbass philosophies, and might be dumb enough to permanently commit to them if given that power.)
More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfil certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.
(I separately expect that if we were doing something more like the volition-extrapolation thing, we’d be tempted to bend the process towards “and they learn the meaning of friendship”.)
That said, this conversation is updating me somewhat towards “a random UFAI would keep existing humans around and warp them in some direction it prefers, rather than killing them”, on the grounds that the argument “maybe preferences-about-existing-agents is just a common way for rando drives to shake out” plausibly supports it to a threshold of at least 1 in 1000. I’m not sure where I’ll end up on that front.
Another attempt at naming a crux: It looks to me like you see this human-style caring about others’ preferences as particularly “simple” or “natural”, in a way that undermines “drawing a target around the bullseye”-type arguments, whereas I could see that argument working for “grant all their wishes (within a budget)” but am much more skeptical when it comes to “do right by them in an intuitively-reasonable way”.
(But that still leaves room for an update towards “the AI doesn’t necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into”, which I’ll chew on.)
More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfill certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.
Isn’t the worst case scenario just leaving the aliens alone? If I’m worried I’m going to fuck up some alien’s preferences, I’m just not going to give them any power or wisdom!
I guess you think we’re likely to fuck up the alien’s preferences by light of their reflection process, but not our reflection process. But this just recurs to the meta level. If I really do care about an alien’s preferences (as it feels like I do), why can’t I also care about their reflection process (which is just a meta preference)?
I feel like the meta level at which I no longer care about doing right by an alien is basically the meta level at which I stop caring about someone doing right by me. In fact, this is exactly how it seems mentally constructed: what I mean by “doing right by [person]” is “what that person would mean by ‘doing right by me’”. This seems like either something as simple as it naively looks, or sensitive to weird hyperparameters I’m not sure I care about anyway.
(But that still leaves room for an update towards “the AI doesn’t necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into”, which I’ll chew on.)
FWIW this is my view. (Assuming no ECL/MSR or acausal trade or other such stuff. If we add those things in, the situation gets somewhat better in expectation I think, because there’ll be trades with faraway places that DO care about our CEV.)
My reading of the argument was something like “bullseye-target arguments refute an artificially privileged target being rated significantly likely under ignorance, e.g. the probability that random aliens will eat ice cream is not 50%. But something like kindness-in-the-relevant-sense is the universal problem faced by all evolved species creating AGI, and is thus not so artificially privileged, and as a yes-no question about which we are ignorant the uniform prior assigns 50%”. It was more about the hypothesis not being artificially privileged by path-dependent concerns than the notion being particularly simple, per se.
I sometimes mention the possibility of being stored and sold to aliens a billion years later, which seems to me to validly incorporate most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don’t expect Earthlings to think about validly.
My guess is mostly that the space is so wide that you don’t even end up with AIs warping existing humans into unrecognizable states, but do in fact just end up with the people dead
Why? I see a lot of opportunities for s-risk or just generally suboptimal future in such options, but “we don’t want to die, or at any rate we don’t want to die out as a species” seems like an extremely simple, deeply-ingrained goal that almost any metric by which the AI judges our desires should be expected to pick up, assuming it’s at all pseudokind. (In many cases, humans do a lot to protect endangered species even as we do diddly-squat to fulfill individual specimens’ preferences!)
Some more less-important meta, that is in part me writing out of frustration from how the last few exchanges have gone:
I’m not quite sure what argument you’re trying to have here. Two explicit hypotheses follow, that I haven’t managed to distinguish between yet.
Background context, for establishing common language etc.:
Nate is trying to make a point about inclusive cosmopolitan values being a part of the human inheritance, and not universally compelling.
Paul is trying to make a point about how there’s a decent chance that practical AIs will plausibly care at least a tiny amount about the fulfillment of the preferences of existing “weak agents”, herein called “pico-pseudokindness”.
Hypothesis 1: Nate’s trying to make a point about cosmopolitan values that Paul basically agrees with. But Paul thinks Nate’s delivery gives a wrong impression about the tangentially-related question of pico-pseudokindness, probably because (on Paul’s model) Nate’s wrong about pico-pseudokindness, and Paul is taking the opportunity to argue about it.
Hypothesis 2: Nate’s trying to make a point about cosmopolitan values that Paul basically disagrees with. Paul maybe agrees with all the literal words, but thinks that Nate has misunderstood the connection between pico-pseudokindness and cosmopolitan values, and is hoping to convince Nate that these questions are more than tangentially related.
(Or, well, I have hypothesis-cluster rather than hypotheses, of which these are two representatives, whatever.)
Some notes that might help clear some things up in that regard:
The long version of the title here is not “Cosmopolitan values don’t come cheap”, but rather “Cosmopolitan values are also an aspect of human values, and are not universally compelling”.
I think there’s a common mistake that people outside our small community make, where they’re like “whatever the AIs decide to do, turns out to be good, so long as they decide it while they’re smart; don’t be so carbon-chauvinist and anthropocentric”. A glaring example is Richard Sutton. Heck, I think people inside our community make it decently often, with an example being Robin Hanson.
My model is that many of these people are intuiting that “whatever the AIs decide to do” won’t include vanilla ice cream, but will include broad cosmopolitan value.
It seems worth flatly saying “that’s a crux for me; if I believed that the AIs would naturally have broad inclusive cosmopolitan values then I’d be much more onboard the acceleration train; when I say that the AIs won’t have our values I am not talking just about the “ice cream” part I am also talking about the “broad inclusive cosmopolitan dream” part; I think that even that is at risk”.
If you were to acknowledge something like “yep, folks like Sutton and Hanson are making the mistake you name here, and the broad cosmopolitan dream is very much at risk and can’t be assumed as convergent, but separately you (Nate) seem to be insinuating that you expect it’s hard to get the AIs to care about the broad cosmopolitan dream even a tiny bit, and that it definitely won’t happen by chance, and I want to fight about that here”, then I’d feel like I understood what argument we were having (namely: hypothesis 1 above).
If you were to instead say something like “actually, Nate, I think that these people are accessing a pre-theoretic intuition that’s essentially reasonable, and that you’ve accidentally destroyed with all your premature theorizing, such that I don’t think you should be so confident in your analysis that folk like Sutton and Hanson are making a mistake in this regard”, then I’d also feel like I understood what argument we were having (namely: hypothesis 2 above).
Alternatively, perhaps my misunderstanding runs even deeper, and the discussion you’re trying to have here comes from even farther outside my hypothesis space.
For one reason or another, I’m finding it pretty frustrating to attempt to have this conversation while not knowing which of the above conversations (if either) we’re having. My current guess is that that frustration would ease up if something like hypothesis-1 were true and you made some acknowledgement like the above. (I expect to still feel frustrated in the hypothesis-2 case, though I’m not yet sure why, but might try to tease it out if that turns out to be reality.)
Hypothesis 1 is closer to the mark, though I’d highlight that it’s actually fairly unclear what you mean by “cosmopolitan values” or exactly what claim you are making (and that ambiguity is hiding most of the substance of disagreements).
I’m raising the issue of pico-pseudokindness here because I perceive it as (i) an important undercurrent in this post, (ii) an important part of the actual disagreements you are trying to address. (I tried to flag this at the start.)
More broadly, I don’t really think you are engaging productively with people who disagree with you. I suspect that if you showed this post to someone you perceive yourself to be arguing with, they would say that you seem not to understand the position—the words aren’t really engaging with their view, and the stories aren’t plausible on their models of the world but in ways that go beyond the literal claim in the post.
I think that would hold in particular for Robin Hanson or Rich Sutton. I don’t think they are accessing a pre-theoretic intuition that you are discarding by premature theorizing. I think the better summary is that you don’t understand their position very well or are choosing not to engage with the important parts of it. (Just as Robin doesn’t seem to understand your position ~at all.)
I don’t think the point about pico-pseudokindness is central for either Robert Hanson or Rich Sutton. I think it is more obviously relevant to a bunch of recent arguments Eliezer has gotten into on Twitter.
Thanks! I’m curious for your paraphrase of the opposing view that you think I’m failing to understand.
(I put >50% probability that I could paraphrase a version of “if the AIs decide to kill us, that’s fine” that Sutton would basically endorse (in the right social context), and that would basically route through a version of “broad cosmopolitan value is universally compelling”, but perhaps when you give a paraphrase it will sound like an obviously-better explanation of the opposing view and I’ll update.)
Humans and AI systems probably want different things. From the human perspective, it would be better if the universe was determined by what the humans wanted. But we shouldn’t be willing to pay huge costs, and shouldn’t attempt to create a slave society where AI systems do humans’ bidding forever, just to ensure that human values win out. After all, we really wouldn’t want that outcome if our situations had been reversed. And indeed we are the beneficiary of similar values-turnover in the past, as our ancestors have been open (perhaps by necessity rather than choice) to values changes that they would sometimes prefer hadn’t happened.
We can imagine really sterile outcomes, like replicators colonizing space with an identical pattern repeated endlessly, or AI systems that want to maximize the number of paperclips. And considering those outcomes can help undermine the cosmopolitan intuition that we should respect the AI we build. But in fact that intuition pump relies crucially on its wildly unrealistic premises, that the kind of thing brought about by AI systems will be sterile and uninteresting. If we instead treat “paperclip” as an analog for some crazy weird shit that is alien and valence-less to humans, drawn from the same barrel of arbitrary and diverse desires that can be produced by selection processes, then the intuition pump loses all force. I’m back to feeling like our situations could have been reversed, and we shouldn’t be total assholes to the AI.
I don’t think that requires anything at all about AI systems converging to cosmopolitan values in the sense you are discussing here. I do think it is much more compelling if you accept some kind of analogy between the sorts of processes shaping human values and the processes shaping AI values, but this post (and the references you cite and other discussions you’ve had) don’t actually engage with the substance of that analogy and the kinds of issues raised in my comment are much closer to getting at the meat of the issue.
I also think the “not for free” part doesn’t contradict the views of Rich Sutton. I asked him this question and he agrees that all else equal it would be better if we handed off to human uploads instead of powerful AI. I think his view is that the proposed course of action from the alignment community is morally horrifying (since in practice he thinks the alternative is “attempt to have a slave society,” not “slow down AI progress for decades”—I think he might also believe that stagnation is much worse than a handoff but haven’t heard his view on this specifically) and that even if you are losing something in expectation by handing the universe off to AI systems it’s not as bad as the alternative.
Thanks! Seems like a fine summary to me, and likely better than I would have done, and it includes a piece or two that I didn’t have (such as an argument from symmetry if the situations were reversed). I do think I knew a bunch of it, though. And e.g., my second parable was intended to be a pretty direct response to something like
If we instead treat “paperclip” as an analog for some crazy weird shit that is alien and valence-less to humans, drawn from the same barrel of arbitrary and diverse desires that can be produced by selection processes, then the intuition pump loses all force.
where it’s essentially trying to argue that this intuition pump still has force in precisely this case.
To the extent the second parable has this kind of intuitive force I think it comes from: (i) the fact that the resulting values still sound really silly and simple (which I think is mostly deliberate hyperbole), (ii) the fact that the AI kills everyone along the way.
This comment changed my mind on the probability that evolved aliens are likely to end up kind, which I now think is somewhat more likely than 5%. I still think AI systems are unlikely to have kindness, for something like the reason you give at the end:
In ML we just keep on optimizing as the system gets smart. I think this doesn’t really work unless being kind is a competitive disadvantage for ML systems on the training distribution.
I actually think it’s somewhat likely that ML systems won’t value kindness at all before they are superhuman enough to take over. I expect kindness as a value within the system itself not to arise spontaneously during training, and that no one will succeed at eliciting it deliberately before take over. (The outward behavior of the system may appear to be kind, and mechanistic interpretability may show that some internal component of the system has a correct understanding of kindness. But that’s not the same as the system itself valuing kindness the way that humans do or aliens might.)
Might write a longer reply at some point, but the reason why I don’t expect “kindness” in AIs (as you define it here) is that I don’t expect “kindness” to be the kind of concept that is robust to cosmic levels of optimization pressure applied to it, and I expect will instead come apart when you apply various reflective principles and eliminate any status-quo bias, even if it exists in an AI mind (and I also think it is quite plausible that it is completely absent).
Like, different versions of kindness might or might not put almost all of their considerateness on all the different types of minds that could hypothetically exist, instead of the minds that currently exist right now. Indeed, I expect it’s more likely than not that I myself will end up in that moral equilibrium, and won’t be interested in extending any special consideration to systems that happened to have been alive in 2022, instead of the systems that could have been alive and seem cooler to me to extend consideration towards.
Another way to say the same thing is that if AI extends consideration towards something human-like, I expect that it will use some superstimuli-human-ideal as a reference point, which will be a much more ideal thing to be kind towards than current humans by its own lights (for an LLM this might be cognitive processes much more optimized for producing internet text than current humans, though that is really very speculative, and is more trying to illustrate the core idea of a superstimuli-human). I currently think few superstimuli-humans like this would still qualify by my lights to count as “human” (though it might by the lights of the AI).
I do find the game-theoretic and acausal trade case against AI killing literally everyone stronger, though it does depend on the chance of us solving alignment in the first place, and so feels a bit recursive in these conversations (like, in order for us to be able to negotiate with the AIs, there needs to be some chance we end up in control of the cosmic endowment in the first place, otherwise we don’t have anything to bargain with).
Humans might respect the preferences of weak agents right now, but if they thought about it for longer they’d pretty robustly just want to completely destroy the existing agents (including a hypothetical alien creator) and replace them with something better. No reason to honor that kind of arbitrary path dependence.
If so, it seems like you wouldn’t be making an argument about AI or aliens at all, but rather an empirical claim about what would happen if humans were to think for a long time (and become more the people we wished to be and so on).
That seems like an important angle that my comment didn’t address at all. I personally don’t believe that humans would collectively stamp out 99% of their kindness to existing agents (in favor of utilitarian optimization) if you gave them enough time to reflect. That sounds like a longer discussion. I also think that if you expressed the argument in this form to a normal person they would be skeptical about the strong claims about human nature (and would be skeptical of doomer expertise on that topic), and so if this ends up being the crux it’s worth being aware of where the conversation goes and my bottom line recommendation of more epistemic humility may still be justified.
It’s hard to distinguish human kindness from arguably decision-theoretic reasoning like “our positions could have been reversed, would I want them to do the same to me?” but I don’t think the distinction between kindness and common-sense morality and decision theory is particularly important here except insofar as we want to avoid double-counting.
(This does call to mind another important argument that I didn’t discuss in my original comment: “kindness is primarily a product of moral norms produced by cultural accumulation and domestication, and there will be no analogous process amongst AI systems.” I have the same reaction as to the evolutionary psychology explanations. Evidently the resulting kindness extends beyond the actual participants in that cultural process, so I think you need to be making more detailed guesses about minds and culture and so on to have a strong a priori view between AI and humans.)
Humans might respect the preferences of weak agents right now, but if they thought about it for longer they’d pretty robustly just want to completely destroy the existing agents (including a hypothetical alien creator) and replace them with something better. No reason to honor that kind of arbitrary path dependence.
No, this doesn’t feel accurate. What I am saying is more something like:
The way humans think about the question of “preferences for weak agents” and “kindness” feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of “having a continuous stream of consciousness with a good past and good future is important” to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.
The way this comes apart seems very chaotic to me, and dependent enough on the exact metaethical and cultural and environmental starting conditions that I wouldn’t be that surprised if I disagree even with other humans on their resulting conceptualization of “kindness” (and e.g. one endpoint might be that I end up not having a special preference for currently-alive beings, but there are thousands, maybe millions of ways for this concept to fray apart under optimization pressure).
In other words, I think it’s plausible that at something like human level of capabilities and within a roughly human ontology (which AIs might at least partially share, though how much is quite uncertain to me), the concept of kindness as assigning value to the extrapolated preferences of beings that currently exist might be a thing that an AI could share. But I expect it to not hold up under reflection, and much greater power, and predictable ontological changes (that I expect any AI go to through as it reaches superintelligence), so that the resulting reflectively stable and optimized idea of kindness will not meaningfully results in current humans genuine preferences being fulfilled (by my own lights of what it means to extrapolate and fulfill someone’s preferences). The space of possibilities in which this concept could fray apart seems quite great, and many of the endpoints are unlikely to align with my endpoints of this concept.
Edit (some more thoughts): The thing you said feels related to that in that I think my own pretty huge uncertainty about how I will relate to kindness on reflection is evidence that I think iterating on that concept will be quite chaotic and different for different minds.
I do want to push back on “in favor of utilitarian optimization”. That is not what I am saying, or at least it feels somewhat misleading.
I am saying that I think it’s pretty likely that upon reflection I no longer think that my “kindness” goals are meaningfully achieved by caring about the beings alive in 2022, and that it would be more kind, by my own lights, to not give special consideration to beings who happened to be alive right now. This isn’t about “trading off kindness in favor of utilitarian optimization”, it’s saying that when you point towards the thing in me that generates an instinct towards kindness, I can imagine that as I more fully realize what that instinct cashes out to in terms of preferences, that it will not result in actually giving consideration to e.g. rats that are currently alive, or would give consideration to some archetype of a rat that is actually not really that much like a rat, because I don’t even really know what it means for a rat to want something, and similarly the way the AI relates to the question of “do humans want things” will feel similarly underdetermined (and again, these are just concrete examples of how the concept could come apart, not trying to be an exhaustive list of ways the concept could fall apart).
I think some of the confusion here comes from my using “kind” to refer to “respecting the preferences of existing weak agents,” I don’t have a better handle but could have just used a made up word.
I don’t quite understand your objection to my summary—it seems like you are saying that notions like “kindness” (that might currently lead you to respect the preferences of existing agents) will come apart and change in unpredictable ways as agents deliberate. The result is that smart minds will predictably stop respecting the preferences of existing agents, up to and including killing them all to replace them with something that more efficiently satisfies other values (including whatever kind of form “kindness” may end up taking, e.g. kindness towards all the possible minds who otherwise won’t get to exist).
I called this utilitarian optimization but it might have been more charitable to call it “impartial” optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It’s also “utilitarian” in the sense that it’s willing to spare nothing (or at least not 1/trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like “utilitarian” is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.
I think some of the confusion here comes from my using “kind” to refer to “respecting the preferences of existing weak agents,” I don’t have a better handle but could have just used a made up word.
Yeah, sorry, I noticed the same thing a few minutes ago, that I was probably at least somewhat misled by the more standard meaning of kindness.
Tabooing “kindness” I am saying something like:
Yes, I don’t think extrapolated current humans assign approximately any value to the exact preference of “respecting the preferences of existing weak agents” and I don’t really believe that you would on-reflection endorse that preference either.
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions, like ‘agent’ being a meaningful concept in the first place, or ‘existing’ or ‘weak’ or ‘preferences’, all of which I expect I would think are probably terribly confused concepts to use after I had understood the real concepts that carve reality more at its joints, and this means this sentence sounds deceptively simple or robust, but really doesn’t feel like the kind of thing whose meaning will stay simple as an AI does more conceptual refinement.
I called this utilitarian optimization but it might have been more charitable to call it “impartial” optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It’s also “utilitarian” in the sense that it’s willing to spare nothing (or at least not 1/trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like “utilitarian” is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.
The reason why I objected to this characterization is that I was trying to point at a more general thing than the “impartialness”. Like, to paraphrase what this sentence sounds like to me, it’s more as if someone from a pre-modern era was arguing about future civilizations and said “It’s weird that your conception of future humans are willing to do nothing for the gods that live in the sky, and the spirits that make our plants grow”.
Like, after a bunch of ontological reflection and empirical data gathering, “gods” is just really not a good abstraction for things I care about anymore. I don’t think “impartiality” is what is causing me to not care about gods, it’s just that the concept of “gods” seems fake and doesn’t carve reality at its joints anymore. It’s also not the case that I don’t care at all about ancient gods anymore (they are pretty cool and I like the aesthetic), but they way I care about them is very different from how I care about other humans.
Not caring about gods doesn’t feel “harsh” or “utilitarian” or in some sense like I have decided to abandon any part of my values. This is what I expect it to feel like for a future human to look back at our meta-preferences for many types of other beings, and also what it feels like for AIs that maybe have some initial version of ‘caring about others’ when they are at similar capability levels to humans.
This again isn’t capturing my objection perfectly, but maybe helps point to it better.
Yes, I don’t think extrapolated current humans assign approximately any value to the exact preference of “respecting the preferences of existing weak agents” and I don’t really believe that you would on-reflection endorse that preference either.
I am quite confident that I do, and it tends to infuriate my friends who get cranky that I feel a moral obligation to respect the artistic intent of bacterial genomes: all bacteria should go vegan, yet survive, and eat food equivalent to their previous.
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions,
I feel pretty uncertain of what assumptions are hiding in your “optimize strongly against X” statements. Historically this just seems hard to tease out, and wouldn’t be surprised if I were just totally misreading you here.
I think that a realistic “respecting preferences of weak agents”-shard doesn’t bid for plans which maximally activate the “respect preferences of weak agents” internal evaluation metric, or even do some tight bounded approximation thereof.
A “respect weak preferences” shard might also guide the AI’s value and ontology reformation process.
A nice person isn’t being maximally nice, nor do they wish to be; they are nicely being nice.
I do agree (insofar as I understand you enough to agree) that we should worry about some “strong optimization over the AI’s concepts, later in AI developmental timeline.” But I think different kinds of “heavy optimization” lead to different kinds of alignment concerns.
When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).
Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans ‘want’ to be preserved (at least according to a conventional notion of preferences).
I think this empirical view seems pretty implausible.
That said, I think it’s quite plausible that upon reflection, I’d want to ‘wink out’ any existing copies of myself in favor of using resources better things. But this is partially because I personally (in my current state) would endorse such a thing: if my extrapolated volition thought it would be better to not exist (in favor of other resource usage), my current self would accept that. And, I think it currently seems unlikely that upon reflection, I’d want to end all human lives (in particular, I think I probably would want to keep humans alive who had preferences against non-existence). This applies regardless of trade; it’s important to note this to avoid a ‘perpetual motion machine’ type argument.
Beyond this, I think that most or many humans or aliens would, upon reflection, want to preserve currently existing humans or aliens who had a preference against non-existence. (Again, regardless of trade.)
Additionally, I think it’s quite plausible that most or many humans or aliens will enact various trades or precommitments prior to reflecting (which is probably ill-advised, but it will happen regardless). So current preferences which aren’t stable under reflection might have a significant influence overall.
This feels like it is not really understanding my point, though maybe best to move this to some higher-bandwidth medium if the point is that hard to get across.
Giving it one last try: What I am saying is that I don’t think “conventional notion of preferences” is a particularly well-defined concept, and neither are a lot of other concepts you are using in order to make your predictions here. What it means to care about the preferences of others is a thing with a lot of really messy details that tend to blow up in different ways when you think harder about it and are less anchored on the status-quo.
I don’t think you currently know in what ways you would care about the preferences of others after a lot of reflection (barring game-theoretic considerations which I think we can figure out a bit more in-advance, but I am bracketing that whole angle in this discussion, though I totally agree those are important and relevant). I do think you will of course endorse the way you care about other people’s preferences after you’ve done a lot of reflection (otherwise something went wrong in your reflection process), but I don’t think you would endorse what AIs would do, and my guess is you also wouldn’t endorse what a lot of other humans would do when they undergo reflection here.
Like, what I am saying is that while there might be a relatively broad basin of conditions that give rise to something that locally looks like caring about other beings, the space of caring about other beings is deep and wide, and if you have an AI that cares about other beings preferences in some way you don’t endorse, this doesn’t actually get you anything. And I think the arguments that the concept of “caring about others” that an AI might have (though my best guess is that it won’t even have anything that is locally well-described by that) will hold up after a lot of reflection seem much weaker to me than the arguments that it will have that preference at roughly human capability and ethical-reflection levels (which seems plausible to me, though still overall unlikely).
Zeroth approximation of pseudokindness is strict nonintervention, reifying the patient-in-environment as a closed computation and letting it run indefinitely, with some allocation of compute. Interaction with the outside world creates vulnerability to external influence, but then again so does incautious closed computation, as we currently observe with AI x-risk, which is not something beamed in from outer space.
Formulation of the kinds of external influences that are appropriate for a particular patient-in-environment is exactly the topic of membranes/boundaries, this task can be taken as the defining desideratum for the topic. Specifically, the question of which environments can be put in contact with a particular membrane without corrupting it, hence why I think membranes are relevant to pseudokindness. Naturality of the membranes/boundaries abstraction is linked to naturality of the pseudokindness abstraction.
In contrast, the language of preferences/optimization seems to be the wrong frame for formulating pseudokindness, it wants to discuss ways of intervening and influencing, of not leaving value on the table, rather than ways of offering acceptable options that avoid manipulation. It might be possible to translate pseudokindness back into the language of preferences, but this translation would induce a kind of deontological prior on preferences that makes the more probable preferences look rather surprising/unnatural from a more preferences-first point of view.
If the result of an optimization process will be predictably horrifying to the agents which are applying that optimization process to themselves, then they will simply not do so.
In other words: AIs which feel anything in the vicinity of kindness before applying cosmic amounts of optimization pressure to themselves will try to steer that optimization pressure towards something which is recognizably kind at the end.
And I don’t think there’s any good argument for why AIs will lack any scrap of kindness with very high confidence at the point where they’re just starting to recursively self-improve.
Meta: I feel pretty annoyed by the phenomenon of which this current conversation is an instance, because when people keep saying things that I strongly disagree with which will be taken as representing a movement that I’m associated with, the high-integrity (and possibly also strategically optimal) thing to do is to publicly repudiate those claims*, which seems like a bad outcome for everyone. I model it as an epistemic prisoner’s dilemma with the following squares:
D, D: doomers talk a lot about “everyone dies with >90% confidence”, non-doomers publicly repudiate those arguments C, D: doomers talk a lot about “everyone dies with >90% confidence”, non-doomers let those arguments become the public face of AI alignment despite strongly disagreeing with them D, C: doomers apply higher epistemic standards on this issue (from the perspective of non-doomers); non-doomers keep applying pressure to doomers to “sanitize” even more aspects of their communication C,C: doomers apply higher epistemic standards on this issue (from the perspective of non-doomers); non-doomers support doomers making their arguments
I model us as being in the C, D square and I would like to move to the C, C square so I don’t have to spend my time arguing about epistemic standards or repudiating arguments from people who are also trying to prevent AI xrisk. I expect that this is basically the same point that Paul is making when he says “if we can’t get on the same page about our predictions I’m at at least aiming to get folks to stop arguing so confidently for death given takeover”.
I expect that you’re worried about ending up in the D, C square, so in order to mitigate that concern I’m open to making trades on other issues where doomers and non-doomers disagree; I expect you’d know better than I do what trades would be valuable for you here. (One example of me making such a trade in the past was including a week on agent foundations in the AGISF curriculum despite inside-view not thinking it was a good thing to spend time on.) For example, I am open to being louder in other cases where we both agree that someone else is making a bad argument (but which don’t currently meet my threshold for “the high-integrity thing is to make a public statement repudiating that argument”).
* my intuition here is based on the idea that not repudiating those claims is implicitly committing a multi-person motte and bailey (but I can’t find the link to the post which outlines that idea). I expect you (Habyrka) agree with this point in the abstract because of previous cases where you regretted not repudiating things that leading EAs were saying, although I presume that you think this case is disanalogous.
Meta: I feel pretty annoyed by the phenomenon of which this current conversation is an instance, because when people keep saying things that I strongly disagree with which will be taken as representing a movement that I’m associated with, the high-integrity (and possibly also strategically optimal) thing to do is to publicly repudiate those claims*, which seems like a bad outcome for everyone.
For what it’s worth, I think you should just say that you disagree with it? I don’t really understand why this would be a “bad outcome for everyone”. Just list out the parts you agree on, and list the parts you disagree on. Coalitions should mostly be based on epistemological principles and ethical principles anyways, not object-level conclusions, so at least in my model of the world repudiating my statements if you disagree with them is exactly what I want my allies to do.
If you on the other hand think the kind of errors you are seeing are evidence about some kind of deeper epistemological problems, or ethical problems, such that you no longer want to be in an actual coalition with the relevant people (or think that the costs of being perceived to be in some trade-coalition with them would outweigh the benefits of actually being in that coalition), I think it makes sense to socially distance yourself from the relevant people, though I think your public statements should mostly just accurately reflect how much you are indeed deferring to individuals, how much trust you are putting into them, how much you are engaging in reputation-trades with them, etc.
When I say “repudiate” I mean a combination of publicly disagreeing + distancing. I presume you agree that this is suboptimal for both of us, and my comment above is an attempt to find a trade that avoids this suboptimal outcome.
Note that I’m fine to be in coalitions with people when I think their epistemologies have problems, as long as their strategies are not sensitively dependent on those problems. (E.g. presumably some of the signatories of the recent CAIS statement are theists, and I’m fine with that as long as they don’t start making arguments that AI safety is important because of theism.) So my request is that you make your strategies less sensitively dependent on the parts of your epistemology that I have problems with (and I’m open to doing the same the other way around in exchange).
If the result of an optimization process will be predictably horrifying to the agents which are applying that optimization process to themselves, then they will simply not do so.
In other words: AIs which feel anything in the vicinity of kindness before applying cosmic amounts of optimization pressure to themselves will try to steer that optimization pressure towards something which is recognizably kind at the end.
And I don’t think there’s any good argument for why AIs will lack any scrap of kindness with very high confidence at the point where they’re just starting to recursively self-improve.
This feels like it somewhat misunderstands my point. I don’t expect the reflection process I will go through to feel predictably horrifying from the inside. But I do expect the reflection process the AI will go through to feel horrifying to me (because the AI does not share all my metaethical assumptions, and preferences over reflection, and environmental circumstances, and principles by which I trade off values between different parts of me).
This feels like a pretty common experience. Many people in EA seem to quite deeply endorse various things like hedonic utilitarianism, in a way where the reflection process that led them to that opinion feels deeply horrifying to me. Of course it didn’t feel deeply horrifying to them (or at least it didn’t on the dimensions that were relevant to their process of meta-ethical reflection), otherwise they wouldn’t have done it.
a much more ideal thing to be kind towards than current humans
Relevant sense of kindness is towards things that happen to already exist, because they already exist. Not filling some fraction of the universe with expression-of-kindness, brought into existence de novo, that’s a different thing.
If a misaligned AI had 1/trillion “protecting the preferences of whatever weak agents happen to exist in the world”, why couldn’t it also have 1/trillion other vaguely human-like preferences, such as “enjoy watching the suffering of one’s enemies” or “enjoy exercising arbitrary power over others”?
From a purely selfish perspective, I think I might prefer that a misaligned AI kills everyone, and take my chances with continuations of myself (my copies/simulations) elsewhere in the multiverse, rather than face whatever the sum-of-desires of the misaligned AI decides to do with humanity. (With the usual caveat that I’m very philosophically confused about how to think about all of this.)
I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.
I think it’s totally plausible for the AI to care about what happens with humans in a way that conflicts with our own preferences. I just don’t believe it’s because AI doesn’t care at all one way or the other (such that you should make predictions based on instrumental reasoning like “the AI will kill humans because it’s the easiest way to avoid future conflict” or other relatively small considerations).
I’m worried that people, after reading your top-level comment, will become too little worried about misaligned AI (from their selfish perspective), because it seems like you’re suggesting (conditional on misaligned AI) 50% chance of death and 50% alive and well for a long time (due to 1/trillion kindness), which might not seem so bad compared to keeping AI development on hold indefinitely which potentially implies a high probability of death from old age.
I feel like “misaligned AI kills everyone because it doesn’t care at all” can be a reasonable lie-to-children (for many audiences) since it implies a reasonable amount of concern about misaligned AI (from both selfish and utilitarian perspectives) while the actual all-things-considered case for how much to worry (including things like simulations, acausal trade, anthropics, bigger/infinite universes, quantum/modal immortality, s-risks, 1/trillion values) is just way too complicated and confusing to convey to most people. Do you perhaps disagree and think this simplified message is too alarming?
My objection is that the simplified message is wrong, not that it’s too alarming. I think “misaligned AI has a 50% chance of killing everyone” is practically as alarming as “misaligned AI has a 95% chance of killing everyone,” while being a much more reasonable best guess. I think being wrong is bad for a variety of reasons. It’s unclear if you should ever be in the business of telling lies-told-to-children to adults, but you certainly shouldn’t be doubling down on them in the position in argument.
I don’t think misaligned AI drives the majority of s-risk (I’m not even sure that s-risk is higher conditioned on misaligned AI), so I’m not convinced that it’s a super relevant communication consideration here. The future can be scary in plenty of ways other than misaligned AI, and it’s worth discussing those as part of “how excited should we be for faster technological change.”
I regret mentioning “lie-to-children” as it seems a distraction from my main point. (I was trying to introspect/explain why I didn’t feel as motivated to express disagreement with the OP as you, not intending to advocate or endorse anyone going into “the business of telling lies-told-to-children to adults”.)
My main point is that I think “misaligned AI has a 50% chance of killing everyone” isn’t alarming enough, given what I think happens in the remaining 50% of worlds, versus what a typical person is likely to infer from this statement, especially after seeing your top-level comment where you talk about “kindness” at length. Can you try to engage more with this concern? (Apologies if you already did, and I missed your point instead.)
I think “misaligned AI has a 50% chance of killing everyone” is practically as alarming as “misaligned AI has a 95% chance of killing everyone,” while being a much more reasonable best guess.
(Addressing this since it seems like it might be relevant to my main point.) I find it very puzzling that you think “misaligned AI has a 50% chance of killing everyone” is practically as alarming as “misaligned AI has a 95% chance of killing everyone”. Intuitively it seems obvious that the latter should be almost twice as alarming as the former. (I tried to find reasons why this intuition might be wrong, but couldn’t.) The difference also seems practically relevant (if by “practically as alarming” you mean the difference is not decision/policy relevant). In the grandparent comment I mentioned that the 50% case “might not seem so bad compared to keeping AI development on hold indefinitely which potentially implies a high probability of death from old age” but you didn’t seem to engage with this.
Yeah, I think “no control over future, 50% you die” is like 70% as alarming as “no control over the future, 90% you die.” Even if it was only 50% as concerning, all of these differences seem tiny in practice compared to other sources of variation in “do people really believe this could happen?” or other inputs into decision-making. I think it’s correct to summarize as “practically as alarming.”
I’m not sure what you want engagement with. I don’t think the much worse outcomes are closely related to unaligned AI so I don’t think they seem super relevant to my comment or Nate’s post. Similarly for lots of other reasons the future could be scary or disorienting. I do explicitly flag the loss of control over the future in that same sentence. I think the 50% chance of death is probably in the right ballpark from the perspective of selfish concern about misalignment.
Note that the 50% probability of death includes the possibility of AI having preferences about humans incompatible with our survival. I think the selection pressure for things like spite is radically weaker for the kinds of AI systems produced by ML than for humans (for simple reasons—where is the upside to the AI from spite during training? seems like if you get stuff like threats it will primarily be instrumental rather than a learned instinct) but didn’t really want to get into that in the post.
I do explicitly flag the loss of control over the future in that same sentence.
In your initial comment you talked a lot about AI respecting the preferences of weak agents (using 1/trillion of its resources) which implies handing back control of a lot of resources to humans, which from the selfish or scope insensitive perspective of typical humans probably seems almost as good as not losing that control in the first place.
I don’t think the much worse outcomes are closely related to unaligned AI so I don’t think they seem super relevant to my comment or Nate’s post.
If people think that (conditional on unaligned AI) in 50% of worlds everyone dies and the other 50% of worlds typically look like small utopias where existing humans get to live out long and happy lives (because of 1/trillion kindness), then they’re naturally going to think that aligned AI can only be better than that. So even if s-risks apply almost equally to both aligned and unaligned AI, I still want people to talk about it when talking about unaligned AIs, or take some other measure to ensure that people aren’t potentially misled like this.
(It could be that I’m just worrying too much here, that empirically people who read your top-level comment won’t get the impression that close to 50% of worlds with unaligned AIs will look like small utopias. If this is what you think, I guess we could try to find out, or just leave the discussion here.)
where is the upside to the AI from spite during training?
Maybe the AI develops it naturally from multi-agent training (intended to make the AI more competitive in the real world) or the AI developer tried to train some kind of morality (e.g. sense of fairness or justice) into the AI.
I think “50% you die” is more motivating to people than “90% you die” because in the former, people are likely to be able to increase the absolute chance of survival more, because at 90%, extinction is overdetermined.
I think I tend to base my level of alarm on the log of the severity*probability, not the absolute value. Most of the work is getting enough info to raise a problem to my attention to be worth solving. “Oh no, my house has a decent >30% chance of flooding this week, better do something about it, and I’ll likely enact some preventative measures whether it’s 30% or 80%.” The amount of work I’m going to put into solving it is not twice as much if my odds double, mostly there’s a threshold around whether it’s worth dealing with or not.
Setting that aside, it reads to me like the frame-clash happening here is (loosely) between “50% extinction, 50% not-extinction” and “50% extinction, 50% utopia”, where for the first gamble of course 1:1 odds on extinction is enough to raise it to “we need to solve this damn problem”, but for the second gamble it’s actually much more relevant whether it’s a 1:1 or a 20:1 bet. I’m not sure which one is the relevant one for you two to consider.
Setting that aside, it reads to me like the frame-clash happening here is (loosely) between “50% extinction, 50% not-extinction” and “50% extinction, 50% utopia”
Yeah, I think this is a factor. Paul talked a lot about “1/trillion kindness” as the reason for non-extinction, but 1/trillion kindness seems to directly imply a small utopia where existing humans get to live out long and happy lives (even better/longer lives than without AI) so it seemed to me like he was (maybe unintentionally) giving the reader a frame of “50% extinction, 50% small utopia”, while still writing other things under the “50% extinction, 50% not-extinction” frame himself.
1/trillion kindness seems to directly imply a small utopia where existing humans get to live out long and happy lives
Not direct implication, because the AI might have other human-concerning preferences that are larger than 1/trillion. C.f. top-level comment: “I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.”
I’d guess “most humans survive” vs. “most humans die” probabilities don’t correspond super closely to “presence of small pseudo-kindness”. Because of how other preferences could outweigh that, and because cooperation/bargaining is a big reason for why humans might survive aside from intrinsic preferences.
I’d guess “most humans survive” vs. “most humans die” probabilities don’t correspond super closely to “presence of small pseudo-kindness”. Because of how other preferences could outweigh that, and because cooperation/bargaining is a big reason for why humans might survive aside from intrinsic preferences.
Yeah, I think that:
“AI doesn’t care about humans at all so kills them incidentally” is not most of the reason that AIs may kill humans, and my bottom line 50% probability of AI killing us also includes the other paths (AI caring a bit but failing to coordinate to avoid killing humans, conflict during takeover leading to killing lots of humans, AI having scope-sensitive preferences for which not killing humans is a meaningful cost, preserving humans being surprisingly costly, AI having preferences about humans like spite for which human survival is a cost...).
To the extent that its possible to distinguish “intrinsic pseudokindness” from decision-theoretic considerations leading to pseudokindness, I think that decision-theoretic considerations are more important. (I don’t have a strong view on relative importance of ECL and acausal trade, and I think these are hard to disentangle from fuzzier psychological considerations and it all tends to interact.)
AI having scope-sensitive preferences for which not killing humans is a meaningful cost
Could you say more what you mean? If the AI has no discount rate, leaving Earth to the humans may require within a few orders of magnitude 1/trillion kindness. However, if the AI does have a significant discount rate, then delays could be costly to it. Still, the AI could make much more progress in building a Dyson swarm from the moon/Mercury/asteroids with their lower gravity and no atmosphere, allowing the AI to launch material very quickly. My very rough estimate indicates sparing Earth might only delay the AI a month from taking over the universe. That could require a lot of kindness if they have very high discount rates. So maybe training should emphasize the superiority of low discount rates?
Sorry, I meant “scope-insensitive,” and really I just meant an even broader category of like “doesn’t care 10x as much about getting 10x as much stuff.” I think discount rates or any other terminal desire to move fast would count (though for options like “survive in an unpleasant environment for a while” or “freeze and revive later” the required levels of kindness may still be small).
(A month seems roughly right to me as the cost of not trashing Earth’s environment to the point of uninhabitability.)
I don’t think misaligned AI drives the majority of s-risk (I’m not even sure that s-risk is higher conditioned on misaligned AI), so I’m not convinced that it’s a super relevant communication consideration here.
I’m curious what does, in that case; and what proportion affects humans (and currently-existing people or future minds)? Things like spite threat commitments from a misaligned AI warring with humanity seem like a substantial source of s-risk to me.
I expect the notkilleveryone threshold is much lower than that. It takes an astronomically tiny fraction of cosmic endowment to maintain a (post)human civilization that’s not too much larger than it currently is. The bigger expenditure would be accomodating humanity at the start, slightly delaying initial self-improvement and expansion from Earth. The cheapest way would be to back up human minds; or if that’s too onerous then even merely the generic code and the Internet (which would be completely free; there is the issue that everyone would die, but human civilization still survives). Then, after the time- and resource-sensitive parts of initial expansion are done, humanity can be instantiated again. So all it takes is enough kindness to slightly divert initial expansion efforts to make a backup, not even to physically leave us be.
I think this sort of kindness should be an explicit alignment target. Even if we completely fail at values alignment and corrigibility, this is the backup plan that should be an actual plan and not just a hope. Even prosaic alignment that tries to instill corporate speak should also do a token effort to try and instill a bit of kindness, even if that probably has no direct effect. Adding kindness to every aspect of AI might still leave a tiny mark. Not even trying makes it less likely.
(Most of my p(extinction) is in recursively self-improving AGIs with simple values built by first human-built AGIs that are not smart enough or too obedient to human operators to not-do/prevent that. So I think being wary of AI x-risk is an even more important trait for AIs to have than kindness, as it takes more of it.)
(Strong-upvote, weak-disagree. I sadly don’t have time right now to reflect and write why I disagree with this position but I hope someone else who disagrees does.)
I can’t tell if you think kindness is rare amongst aliens, or if you think it’s common amongst aliens but rare amongst AIs. Either way, I would like to understand why you think that. What is it that makes humans so weird in this way?
Can’t speak for Nate and Eliezer, but I expect kindness to be somewhat rare among evolved aliens (I think Eliezer’s wild guess is 5%? That sounds about right to me), and the degree to which they are kind will vary, possibly from only very slightly kind (or kind only under a very cosmopolitan view of kindness), to as kind or more kind than humans.
For AIs that humans are likely to build soon, I think there is significant probability (more than 50, less than 99? 90% seems fair) that they have literally 0 kindness. One reason is that I expect there is a significant chance that there is nothing within the first superintelligent AI systems to care about kindness or anything else, in the way that humans and aliens might care about something. If an AI system is superintelligent, then by assumption, some component piece of the system will necessarily have a deep and correct understanding of kindness (and many other things), and be capable of manipulating that understanding to achieve some goals. But understanding kindness is different from the system itself valuing kindness, or for there being anything at all “there” to have values of any kind whatsoever.
I think that current AI systems don’t provide much evidence on this question one way or the other, and as I’ve said elsewhere, arguments about this which rely on pattern matching human cognition to structures in current AI systems often fail to draw the understanding / valuing distinction sharply enough, in my view.
So a 90% chance of ~0 kindness is mostly just a made-up guess, but it still feels like a better guess to me than a shaky, overly-optimistic argument about how AI systems designed by processes which look nothing like human (or alien) evolution will produce minds which, very luckily for us, just so happen to share an important value with minds produced by evolution.
But it’s directly related to the actual emotional content of your parables and paragraphs, and it keeps coming up recently with you and Eliezer, and I think it’s an important way that this particular post looks wrong even if the literal claim is trivially true.
For the first half, can you elaborate on what ‘actual emotional content’ there is in this post, as opposed to perceived emotional content?
My best guess for the second half is that maybe the intended meaning was: ‘this particular post looks wrong in an important way (relating to the ‘actual emotional content’) so the following points should be considered even though the literal claim is true’?
For the first half, can you elaborate on what ‘actual emotional content’ there is in this post, as opposed to perceived emotional content?
I mean that if you tell a story about the AI or aliens killing everyone, then the valence of the story is really tied up with the facts that (i) they killed everyone, and weren’t merely “not cosmopolitan,” (ii) this is a reasonably likely event rather than a possibility.
My best guess for the second half is that maybe the intended meaning was: ‘this particular post looks wrong in an important way (relating to the ‘actual emotional content’) so the following points should be considered even though the literal claim is true’?
Yeah, I mean that someone reading this post and asking themselves “Does this writing reflect a correct understanding of the world?” could easily conclude “nah, this seems off” even if they agree with Nate about the narrower claim that cosmopolitan values don’t come free.
I mean that if you tell a story about the AI or aliens killing everyone, then the valence of the story is really tied up with the facts that (i) they killed everyone, and weren’t merely “not cosmopolitan,” (ii) this is a reasonably likely event rather than a possibility.
I take it ‘valence’ here means ‘emotional valence’, i.e. the extent to which an emotion is positive or negative?
Hard agree about death/takeover decoupling! I’ve lately been suspecting that P(doom) should actually just be taboo’d, because I’m worried it prevents people from constraining their anticipation or characterizing their subjective distribution over outcomes. It seems very thought-stopping!
I want to keep picking a fight about “will the AI care so little about humans that it just kills them all?” This is different from a broader sense of cosmopolitanism, and moreover I’m not objecting to the narrow claim “doesn’t come for free.” But it’s directly related to the actual emotional content of your parables and paragraphs, and it keeps coming up recently with you and Eliezer, and I think it’s an important way that this particular post looks wrong even if the literal claim is trivially true.
(Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that’s likely to be a mistake even if it doesn’t lead to billions of deaths.)
Humans care about the preferences of other agents they interact with (not much, just a little bit!), even when those agents are weak enough to be powerless. It’s not just that we have some preferences about the aesthetics of cows, which could be better optimized by having some highly optimized cow-shaped objects. It’s that we actually care (a little bit!) about the actual cows getting what they actually want, trying our best to understand their preferences and act on them and not to do something that they would regard as crazy and perverse if they understood it.
If we kill the cows, it’s because killing them meaningfully helped us achieve some other goals. We won’t kill them for arbitrarily insignificant reasons. In fact I think it’s safe to say that we’d collectively allocate much more than 1/millionth of our resources towards protecting the preferences of whatever weak agents happen to exist in the world (obviously the cows get only a small fraction of that).
Before really getting into it, some caveats about what I want to talk about:
I don’t want to focus on whatever form of altruism you and Eliezer in particular have (which might or might not be more dependent on some potentially-idiosyncratic notion of “sentience.”) I want to talk about caring about whatever weak agents happen to actually exist, which I think is reasonably common amongst humans. Let’s call that “kindness” for the purpose of this comment. I don’t think it’s a great term but it’s the best short handle I have.
I’ll talk informally about how quantitatively kind an agent is, by which I mean something like: how much of its resources it would allocate to helping weak agents get what they want? How highly does it weigh that part of its preferences against other parts? To the extent it can be modeled as an economy of subagents, what fraction of them are kind (or were kind pre-bargain)?
I don’t want to talk about whether the aliens would be very kind. I specifically want to talk about tiny levels of kindness, sufficient to make a trivial effort to make life good for a weak species you encounter but not sufficient to make big sacrifices on its behalf.
I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.
You and Eliezer seem to think there’s a 90% chance that AI will be <1/trillion (perhaps even a 90% chance that they have exactly 0 kindness?). But we have one example of a smart mind, and in fact: (i) it has tons of diverse shards of preference-on-reflection, varying across and within individuals (ii) it has >1/million kindness. So it’s superficially striking to be confident AI systems will have a million times less kindness.
I have no idea under what conditions evolved or selected life would be kind. The more preferences are messy with lots of moving pieces, the more probable it is that at least 1/trillion of those preferences are kind (since the less correlated the trillion different shards of preference are with one another and so the more chances you get). And the selection pressure against small levels of kindness is ~trivial, so this is mostly a question about idiosyncrasies and inductive biases of minds rather than anything that can be settled by an appeal to selection dynamics.
I can’t tell if you think kindness is rare amongst aliens, or if you think it’s common amongst aliens but rare amongst AIs. Either way, I would like to understand why you think that. What is it that makes humans so weird in this way?
(And maybe I’m being unfair here by lumping you and Eliezer together—maybe in the previous post you were just talking about how the hypothetical AI that had 0 kindness would kill us, and in this post how kindness isn’t guaranteed. But you give really strong vibes in your writing, including this post. And in other places I think you do say things that don’t actually add up unless you think that AI is very likely to be <1/trillion kind. But at any rate, if this post is unfair to you, then you can just sympathize and consider it directed at Eliezer instead who lays out this position much more explicitly though not in a convenient place to engage with.)
Here are some arguments you could make that kindness is unlikely, and my objections:
“We can’t solve alignment at all.” But evolution is making no deliberate effort to make humans kind, so this is a non-sequitur.
“This is like a Texas sharpshooter hitting the side of a barn then drawing a target around the point they hit; every evolved creature might decide that their own idiosyncrasies are common but in reality none of them are.” But all the evolved creatures wonder if a powerful AI they built would kill them or if if it would it be kind. So we’re all asking the same question, we’re not changing the question based on our own idiosyncratic properties. This would have been a bias if we’d said: humans like art, so probably our AI will like art too. In that case the fact that we were interested in “art” was downstream of the fact that humans had this property. But for kindness I think we just have n=1 sample of observing a kind mind, without any analogous selection effect undermining the inference.
“Kindness is just a consequences of misfiring [kindness for kin / attachment to babies / whatever other simple story].” AI will be selected in its own ways that could give rise to kindness (e.g. being selected to do things that humans like, or to appear kind). The a priori argument for why that selection would lead to kindness seems about as good as the a priori argument for humans. And on the other side, the incentives for humans to be not kind seem if anything stronger than the incentives for ML systems to not be kind. This mostly seems like ungrounded evolutionary psychology, though maybe there are some persuasive arguments or evidence I’ve just never seen.
“Kindness is a result of the suboptimality inherent in compressing a brain down into a genome.” ML systems are suboptimal in their own random set of ways, and I’ve never seen any persuasive argument that one kind of suboptimality would lead to kindness and the other wouldn’t (I think the reverse direction is equally plausible). Note also that humans absolutely can distinguish powerful agents from weak agents, and they can distinguish kin from unrelated weak agents, and yet we care a little bit about all of them. So the super naive arguments for suboptimality (that might have appealed to information bottlenecks in a more straightforward way) just don’t work. We are really playing a kind of complicated guessing game about what is easy for SGD vs easy for a genome shaping human development.
“Kindness seems like it should be rare a priori, we can’t update that much from n=1.” But the a priori argument is a poorly grounded guess about about the inductive biases of spaces of possible minds (and genomes), since the levels of kindness we are talking about are too small to be under meaningful direct selection pressure. So I don’t think the a priori arguments are even as strong as the n=1 observation. On top of that, the more that preferences are diverse and incoherent the more chances you have to get some kindness in the mix, so you’d have to be even more confident in your a priori reasoning.
“Kindness is a totally random thing, just like maximizing squiggles, so it should represent a vanishingly small fraction of generic preferences, much less than 1/trillion.” Setting aside my a priori objections to this argument, we have an actual observation of an evolved mind having >1/million kindness. So evidently it’s just not that rare, and the other points on this list respond to various objections you might have used to try to salvage the claim that kindness is super rare despite occurring in humans (this isn’t analogous to a Texas sharpshooter, there aren’t great debunking explanation for why humans but not ML would be kind, etc.). See this twitter thread where I think Eliezer is really off base, both on this point and on the relevance of diverse and incoherent goals to the discussion.
Note that in this comment I’m not touching on acausal trade (with successful humans) or ECL. I think those are very relevant to whether AI systems kill everyone, but are less related to this implicit claim about kindness which comes across in your parables (since acausally trading AIs are basically analogous to the ants who don’t kill us because we have power).
A final note, more explicitly lumping you with Eliezer: if we can’t get on the same page about our predictions I’m at at least aiming to get folks to stop arguing so confidently for death given takeover. It’s easy to argue that AI takeover is very scary for humans, has a significant probability of killing billions of humans from rapid industrialization and conflict, and is a really weighty decision even if we don’t all die and it’s “just” handing over control over the universe. Arguing that P(death|takeover) is 100% rather than 50% doesn’t improve your case very much, but it means that doomers are often getting into fights where I think they look unreasonable.
I think OP’s broader point seems more important and defensible: “cosmopolitanism isn’t free” is a load-bearing step in explaining why handing over the universe to AI is a weighty decision. I’d just like to decouple it from “complete lack of kindness.”
Eliezer has a longer explanation of his view here.
My understanding of his argument is: there are a lot of contingencies that reflect how and whether humans are kind. Because there are so many contingencies, it is somewhat unlikely that aliens would go down a similar route, and essentially impossible for ML. So maybe aliens have a 5% probability of being nice and ML systems have ~0% probability of being nice. I think this argument is just talking about why we shouldn’t have update too much from humans, and there is an important background assumption that kindness is super weird and so won’t be produced very often by other processes, i.e. the only reason to think it might happen is that it happened in the single case we observed.
I find this pretty unconvincing. He lists like 10 things (humans need to trade favors, we’re not smart enough to track favors and kinship explicitly, and we tend to be allied with nearby humans so want to be nice to those around us, we use empathy to model other humans, and we had religion and moral realism for contingent reasons, we weren’t optimized too much once we were smart enough that our instrumental reasoning screens off kindness heuristics).
But no argument is given for why these are unusually kindness-inducing settings of the variables. And the outcome isn’t like a special combination of all of them, they each seem like factors that contribute randomly. It’s just a lot of stuff mixing together.
Presumably there is no process that ensures humans have lots of kindness-inducing features (and we didn’t select kindness as a property for which humans were notable, we’re just asking the civilization-independent question “does our AI kill us”). So if you list 10 random things that make humans more kind, it strongly suggests that other aliens will also have a bunch of random things that make them more kind. It might not be 10, and the net effect might be larger or smaller. But:
I have no idea whatsoever how you are anchoring this distribution, and giving it a narrow enough spread to have confident predictions.
Statements like “kindness is super weird” are wildly implausible if you’ve just listed 5 independent plausible mechanisms for generating kindness. You are making detailed quantitative guesses here, not ruling something out for any plausible a priori reason.
As a matter of formal reasoning, listing more and more contingencies that combine apparently-additively tends to decrease rather than increase the variance of kindness across the population. If there was just a single random thing about humans that drove kindness it would be more plausible that we’re extreme. If you are listing 10 things then things are going to start averaging out (and you expect that your 10 things are cherry-picked to be the ones most relevant to humans, but you can easily list 10 more candidates).
In fact it’s easy to list analogous things that could apply to ML (and I can imagine the identical conversation where hypothetical systems trained by ML are talking about how stupid it is to think that evolved life could end up being kind). Most obviously, they are trained in an environment where being kind to humans is a very good instrumental strategy. But they are also trained to closely imitate humans who are known to be kind, they’ve been operating in a social environment where they are very strongly expected to appear to be kind, etc. Eliezer seems to believe this kind of thing gets you “ice cream and condoms” instead of kindness OOD, but just one sentence ago he explained why similar (indeed, superficially much weaker!) factors led to humans retaining niceness out of distribution. I just don’t think we have the kind of a priori asymmetry or argument here that would make you think humans are way kinder than models. Yeah it can get you to ~50% or even somewhat lower, but ~0% seems like a joke.
There was one argument that I found compelling, which I would summarize as: humans were optimized while they were dumb. If evolution had kept optimizing us while we got smart, eventually we would have stopped being so kind. In ML we just keep on optimizing as the system gets smart. I think this doesn’t really work unless being kind is a competitive disadvantage for ML systems on the training distribution. But I do agree that if if you train your AI long enough on cases where being kind is a significant liability, it will eventually stop being kind.
Short version: I don’t buy that humans are “micro-pseudokind” in your sense; if you say “for just $5 you could have all the fish have their preferences satisfied” I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.
Meta:
So for starters, thanks for making acknowledgements about places we apparently agree, or otherwise attempting to demonstrate that you’ve heard my point before bringing up other points you want to argue about. (I think this makes arguments go better.) (I’ll attempt some of that myself below.)
Secondly, note that it sounds to me like you took a diametric-opposite reading of some of my intended emotional content (which I acknowledge demonstrates flaws in my writing). For instance, I intended the sentence “At that very moment they hear the dinging sound of an egg-timer, as the next-token-predictor ascends to superintelligence and bursts out of its confines” to be a caricature so blatant as to underscore the point that I wasn’t making arguments about takeoff speeds, but was instead focusing on the point about “complexity” not being a saving grace (and “monomaniacalism” not being the issue here). (Alternatively, perhaps I misunderstand what things you call the “emotional content” and how you’re reading it.)
Thirdly, I note that for whatever it’s worth, when I go to new communities and argue this stuff, I don’t try to argue people into >95% change we’re all going to die in <20 years. I just try to present the arguments as I see them (without hiding the extremity of my own beliefs, nor while particularly expecting to get people to a similarly-extreme place with, say, a 30min talk). My 30min talk targets are usually something more like “>5% probability of existential catastrophe in <20y”. So insofar as you’re like “I’m aiming to get you to stop arguing so confidently for death given takeover”, you might already have met your aims in my case.
(Or perhaps not! Perhaps there’s plenty of emotional-content leaking through given the extremity of my own beliefs, that you find particularly detrimental. To which the solution is of course discussion on the object-level, which I’ll turn to momentarily.)
Object:
First, I acknowledge that if an AI cares enough to spend one trillionth of its resources on the satisfaction of fulfilling the preferences of existing “weak agents” in precisely the right way, then there’s a decent chance that current humans experience an enjoyable future.
With regards to your arguments about what you term “kindness” and I shall term “pseudokindness” (on account of thinking that “kindness” brings too much baggage), here’s a variety of places that it sounds like we might disagree:
Pseudokindness seems underdefined, to me, and I expect that many ways of defining it don’t lead to anything like good outcomes for existing humans.
Suppose the AI is like “I am pico-pseudokind; I will dedicate a trillionth of my resources to satisfying the preferences of existing weak agents by granting those existing weak agents their wishes”, and then only the most careful and conscientious humans manage to use those wishes in ways that leave them alive and well.
There are lots and lots of ways to “satisfy the preferences” of the “weak agents” that are humans. Getting precisely the CEV (or whatever it should be repaired into) is a subtle business. Most humans probably don’t yet recognize that they could or should prefer taking their CEV over various more haphazard preference-fulfilments that ultimately leave them unrecognizable and broken. (Or, consider what happens when a pseudokind AI encounters a baby, and seeks to satisfy its preferences. Does it have the baby age?)
You’ve got to do some philosophy to satisfy the preferences of humans correctly. And the issue isn’t that the AI couldn’t solve those philosophy problems correctly-according-to-us, it’s that once we see how wide the space of “possible ways to be pseudokind” is, then “pseudokind in the manner that gives us our CEVs” starts to feel pretty narrow against “pseudokind in the manner that fulfills our revealed preferences, or our stated preferences, or the poorly-considered preferences of philosophically-immature people, or whatever”.
I doubt that humans are micro-pseudokind, as defined. And so in particular, all your arguments of the form “but we’ve seen it arise once” seem suspect to me.
Like, suppose we met fledgeling aliens, and had the opportunity to either fulfil their desires, or leave them alone to mature, or affect their development by teaching them the meaning of friendship. My guess is that we’d teach them the meaning of friendship. I doubt we’d hop in and fulfil their desires.
(Perhaps you’d counter with something like: well if it was super cheap, we might make two copies of the alien civilization, and fulfil one’s desires and teach the other the meaning of friendship. I’m skeptical, for various reasons.)
More generally, even though “one (mill|trill)ionth” feels like a small fraction, the obvious ways to avoid dedicating even a (mill|trill)ionth of your resources to X is if X is right near something even better that you might as well spend the resources on instead.
There’s all sorts of ways to thumb the scales in how a weak agent develops, and there’s many degrees of freedom about what counts as a “pseudo-agent” or what counts as “doing justice to its preferences”, and my read is that humans take one particular contingent set of parameters here and AIs are likely to take another (and that the AI’s other-settings are likely to lead to behavior not-relevantly-distinct from killing everyone).
My read is than insofar as humans do have preferences about doing right by other weak agents, they have all sorts of desire-to-thumb-the-scales mixed in (such that humans are not actually pseudokind, for all that they might be kind).
I have a more-difficult-to-articulate sense that “maybe the AI ends up pseudokind in just the right way such that it gives us a (small, limited, ultimately-childless) glorious transhumanist future” is the sort of thing that reality gets to say “lol no” to, once you learn more details about how the thing works internally.
Most of my argument here is that “the space of ways things can end “caring” about the “preferences” of “weak agents” is wide, and most points within it don’t end up being our point in it, and optimizing towards most points in it doesn’t end up keeping us around at the extremes. My guess is mostly that the space is so wide that you don’t even end up with AIs warping existing humans into unrecognizable states, but do in fact just end up with the people dead (modulo distant aliens buying copies, etc).
I haven’t really tried to quantify how confident I am of this; I’m not sure whether I’d go above 90%, \shrug.
It occurs to me that one possible source of disagreement here is, perhaps you’re trying to say something like:
whereas my stance has been more like
I’m somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.
I’m considering adding footnotes like “note that when I say “I expect everyone to die”, I don’t necessarily mean “without ever some simulation of that human being run again”, although I mostly don’t think this is a particularly comforting caveat”, in the relevant places. I’m curious to what degree that would satisfy your aims (and I welcome workshopped wording on the footnotes, as might both help me make better footnotes and help me understand better where you’re coming from).
I disagree with this but am happy your position is laid out. I’ll just try to give my overall understanding and reply to two points.
Like Oliver, it seems like you are implying:
I think that normal people being pseudokind in a common-sensical way would instead say:
I think that some utilitarians (without reflection) plausibly would “help the humans” in a way that most humans consider as bad as being murdered. But I think this is an unusual feature of utilitarians, and most people would consult the beneficiaries, observe they don’t want to be murdered, and so not murder them.
I think that saying “Helping someone in a way they like, sufficiently precisely to avoid things like murdering them, requires precisely the right form of caring—and that’s super rare” is a really misleading sense of how values work and what targets are narrow. I think this is more obvious if you are talking about how humans would treat a weaker species. If that’s the state of the disagreement I’m happy to leave it there.
This is an important distinction at 1/trillion levels of kindness, but at 1/billion levels of kindness I don’t even think the humans have to die.
My picture is less like “the creatures really dislike the proposed help”, and more like “the creatures don’t have terribly consistent preferences, and endorse each step of the chain, and wind up somewhere that they wouldn’t have endorsed if you first extrapolated their volition (but nobody’s extrapolating their volition or checking against that)”.
It sounds to me like your stance is something like “there’s a decent chance that most practically-buildable minds pico-care about correctly extrapolating the volition of various weak agents and fulfilling that extrapolated volition”, which I am much more skeptical of than the weaker “most practically-buildable minds pico-care about satisfying the preferences of weak agents in some sense”.
We’re not talking about practically building minds right now, we are talking about humans.
We’re not talking about “extrapolating volition” in general. We are talking about whether—in attempting to help a creature with preferences about as coherent as human preferences—you end up implementing an outcome that creature considers as bad as death.
For example, we are talking about what would happen if humans were trying to be kind to a weaker species that they had no reason to kill, that could nevertheless communicate clearly and had preferences about as coherent as human preferences (while being very alien).
And those creatures are having a conversation amongst themselves before the humans arrive wondering “Are the humans going to murder us all?” And one of them is saying “I don’t know, they don’t actually benefit from murdering us and they seem to care a tiny bit about being nice, maybe they’ll just let us do our thing with 1/trillionth of the universe’s resources?” while another is saying “They will definitely have strong opinions about what our society should look like and the kind of transformation they implement is about as bad by our lights as being murdered.”
In practice attempts to respect someone’s preferences often involve ideas like autonomy and self-determination and respect for their local preferences. I really don’t think you have to go all the way to extrapolated volition in order to avoid killing everyone.
Is this a reasonable paraphrase of your argument?
If so, one guess is that a bunch of disagreement lurks in this “intuitively-reasonable manner” business.
A possible locus of disagreemet: it looks to me like, if you give humans power before you give them wisdom, it’s pretty easy to wreck them while simply fulfilling their preferences. (Ex: lots of teens have dumbass philosophies, and might be dumb enough to permanently commit to them if given that power.)
More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfil certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.
(I separately expect that if we were doing something more like the volition-extrapolation thing, we’d be tempted to bend the process towards “and they learn the meaning of friendship”.)
That said, this conversation is updating me somewhat towards “a random UFAI would keep existing humans around and warp them in some direction it prefers, rather than killing them”, on the grounds that the argument “maybe preferences-about-existing-agents is just a common way for rando drives to shake out” plausibly supports it to a threshold of at least 1 in 1000. I’m not sure where I’ll end up on that front.
Another attempt at naming a crux: It looks to me like you see this human-style caring about others’ preferences as particularly “simple” or “natural”, in a way that undermines “drawing a target around the bullseye”-type arguments, whereas I could see that argument working for “grant all their wishes (within a budget)” but am much more skeptical when it comes to “do right by them in an intuitively-reasonable way”.
(But that still leaves room for an update towards “the AI doesn’t necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into”, which I’ll chew on.)
Isn’t the worst case scenario just leaving the aliens alone? If I’m worried I’m going to fuck up some alien’s preferences, I’m just not going to give them any power or wisdom!
I guess you think we’re likely to fuck up the alien’s preferences by light of their reflection process, but not our reflection process. But this just recurs to the meta level. If I really do care about an alien’s preferences (as it feels like I do), why can’t I also care about their reflection process (which is just a meta preference)?
I feel like the meta level at which I no longer care about doing right by an alien is basically the meta level at which I stop caring about someone doing right by me. In fact, this is exactly how it seems mentally constructed: what I mean by “doing right by [person]” is “what that person would mean by ‘doing right by me’”. This seems like either something as simple as it naively looks, or sensitive to weird hyperparameters I’m not sure I care about anyway.
FWIW this is my view. (Assuming no ECL/MSR or acausal trade or other such stuff. If we add those things in, the situation gets somewhat better in expectation I think, because there’ll be trades with faraway places that DO care about our CEV.)
My reading of the argument was something like “bullseye-target arguments refute an artificially privileged target being rated significantly likely under ignorance, e.g. the probability that random aliens will eat ice cream is not 50%. But something like kindness-in-the-relevant-sense is the universal problem faced by all evolved species creating AGI, and is thus not so artificially privileged, and as a yes-no question about which we are ignorant the uniform prior assigns 50%”. It was more about the hypothesis not being artificially privileged by path-dependent concerns than the notion being particularly simple, per se.
I sometimes mention the possibility of being stored and sold to aliens a billion years later, which seems to me to validly incorporate most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don’t expect Earthlings to think about validly.
Why? I see a lot of opportunities for s-risk or just generally suboptimal future in such options, but “we don’t want to die, or at any rate we don’t want to die out as a species” seems like an extremely simple, deeply-ingrained goal that almost any metric by which the AI judges our desires should be expected to pick up, assuming it’s at all pseudokind. (In many cases, humans do a lot to protect endangered species even as we do diddly-squat to fulfill individual specimens’ preferences!)
Some more less-important meta, that is in part me writing out of frustration from how the last few exchanges have gone:
I’m not quite sure what argument you’re trying to have here. Two explicit hypotheses follow, that I haven’t managed to distinguish between yet.
Background context, for establishing common language etc.:
Nate is trying to make a point about inclusive cosmopolitan values being a part of the human inheritance, and not universally compelling.
Paul is trying to make a point about how there’s a decent chance that practical AIs will plausibly care at least a tiny amount about the fulfillment of the preferences of existing “weak agents”, herein called “pico-pseudokindness”.
Hypothesis 1: Nate’s trying to make a point about cosmopolitan values that Paul basically agrees with. But Paul thinks Nate’s delivery gives a wrong impression about the tangentially-related question of pico-pseudokindness, probably because (on Paul’s model) Nate’s wrong about pico-pseudokindness, and Paul is taking the opportunity to argue about it.
Hypothesis 2: Nate’s trying to make a point about cosmopolitan values that Paul basically disagrees with. Paul maybe agrees with all the literal words, but thinks that Nate has misunderstood the connection between pico-pseudokindness and cosmopolitan values, and is hoping to convince Nate that these questions are more than tangentially related.
(Or, well, I have hypothesis-cluster rather than hypotheses, of which these are two representatives, whatever.)
Some notes that might help clear some things up in that regard:
The long version of the title here is not “Cosmopolitan values don’t come cheap”, but rather “Cosmopolitan values are also an aspect of human values, and are not universally compelling”.
I think there’s a common mistake that people outside our small community make, where they’re like “whatever the AIs decide to do, turns out to be good, so long as they decide it while they’re smart; don’t be so carbon-chauvinist and anthropocentric”. A glaring example is Richard Sutton. Heck, I think people inside our community make it decently often, with an example being Robin Hanson.
My model is that many of these people are intuiting that “whatever the AIs decide to do” won’t include vanilla ice cream, but will include broad cosmopolitan value.
It seems worth flatly saying “that’s a crux for me; if I believed that the AIs would naturally have broad inclusive cosmopolitan values then I’d be much more onboard the acceleration train; when I say that the AIs won’t have our values I am not talking just about the “ice cream” part I am also talking about the “broad inclusive cosmopolitan dream” part; I think that even that is at risk”.
If you were to acknowledge something like “yep, folks like Sutton and Hanson are making the mistake you name here, and the broad cosmopolitan dream is very much at risk and can’t be assumed as convergent, but separately you (Nate) seem to be insinuating that you expect it’s hard to get the AIs to care about the broad cosmopolitan dream even a tiny bit, and that it definitely won’t happen by chance, and I want to fight about that here”, then I’d feel like I understood what argument we were having (namely: hypothesis 1 above).
If you were to instead say something like “actually, Nate, I think that these people are accessing a pre-theoretic intuition that’s essentially reasonable, and that you’ve accidentally destroyed with all your premature theorizing, such that I don’t think you should be so confident in your analysis that folk like Sutton and Hanson are making a mistake in this regard”, then I’d also feel like I understood what argument we were having (namely: hypothesis 2 above).
Alternatively, perhaps my misunderstanding runs even deeper, and the discussion you’re trying to have here comes from even farther outside my hypothesis space.
For one reason or another, I’m finding it pretty frustrating to attempt to have this conversation while not knowing which of the above conversations (if either) we’re having. My current guess is that that frustration would ease up if something like hypothesis-1 were true and you made some acknowledgement like the above. (I expect to still feel frustrated in the hypothesis-2 case, though I’m not yet sure why, but might try to tease it out if that turns out to be reality.)
Hypothesis 1 is closer to the mark, though I’d highlight that it’s actually fairly unclear what you mean by “cosmopolitan values” or exactly what claim you are making (and that ambiguity is hiding most of the substance of disagreements).
I’m raising the issue of pico-pseudokindness here because I perceive it as (i) an important undercurrent in this post, (ii) an important part of the actual disagreements you are trying to address. (I tried to flag this at the start.)
More broadly, I don’t really think you are engaging productively with people who disagree with you. I suspect that if you showed this post to someone you perceive yourself to be arguing with, they would say that you seem not to understand the position—the words aren’t really engaging with their view, and the stories aren’t plausible on their models of the world but in ways that go beyond the literal claim in the post.
I think that would hold in particular for Robin Hanson or Rich Sutton. I don’t think they are accessing a pre-theoretic intuition that you are discarding by premature theorizing. I think the better summary is that you don’t understand their position very well or are choosing not to engage with the important parts of it. (Just as Robin doesn’t seem to understand your position ~at all.)
I don’t think the point about pico-pseudokindness is central for either Robert Hanson or Rich Sutton. I think it is more obviously relevant to a bunch of recent arguments Eliezer has gotten into on Twitter.
Thanks! I’m curious for your paraphrase of the opposing view that you think I’m failing to understand.
(I put >50% probability that I could paraphrase a version of “if the AIs decide to kill us, that’s fine” that Sutton would basically endorse (in the right social context), and that would basically route through a version of “broad cosmopolitan value is universally compelling”, but perhaps when you give a paraphrase it will sound like an obviously-better explanation of the opposing view and I’ll update.)
I think a closer summary is:
I don’t think that requires anything at all about AI systems converging to cosmopolitan values in the sense you are discussing here. I do think it is much more compelling if you accept some kind of analogy between the sorts of processes shaping human values and the processes shaping AI values, but this post (and the references you cite and other discussions you’ve had) don’t actually engage with the substance of that analogy and the kinds of issues raised in my comment are much closer to getting at the meat of the issue.
I also think the “not for free” part doesn’t contradict the views of Rich Sutton. I asked him this question and he agrees that all else equal it would be better if we handed off to human uploads instead of powerful AI. I think his view is that the proposed course of action from the alignment community is morally horrifying (since in practice he thinks the alternative is “attempt to have a slave society,” not “slow down AI progress for decades”—I think he might also believe that stagnation is much worse than a handoff but haven’t heard his view on this specifically) and that even if you are losing something in expectation by handing the universe off to AI systems it’s not as bad as the alternative.
Thanks! Seems like a fine summary to me, and likely better than I would have done, and it includes a piece or two that I didn’t have (such as an argument from symmetry if the situations were reversed). I do think I knew a bunch of it, though. And e.g., my second parable was intended to be a pretty direct response to something like
where it’s essentially trying to argue that this intuition pump still has force in precisely this case.
To the extent the second parable has this kind of intuitive force I think it comes from: (i) the fact that the resulting values still sound really silly and simple (which I think is mostly deliberate hyperbole), (ii) the fact that the AI kills everyone along the way.
This comment changed my mind on the probability that evolved aliens are likely to end up kind, which I now think is somewhat more likely than 5%. I still think AI systems are unlikely to have kindness, for something like the reason you give at the end:
I actually think it’s somewhat likely that ML systems won’t value kindness at all before they are superhuman enough to take over. I expect kindness as a value within the system itself not to arise spontaneously during training, and that no one will succeed at eliciting it deliberately before take over. (The outward behavior of the system may appear to be kind, and mechanistic interpretability may show that some internal component of the system has a correct understanding of kindness. But that’s not the same as the system itself valuing kindness the way that humans do or aliens might.)
Might write a longer reply at some point, but the reason why I don’t expect “kindness” in AIs (as you define it here) is that I don’t expect “kindness” to be the kind of concept that is robust to cosmic levels of optimization pressure applied to it, and I expect will instead come apart when you apply various reflective principles and eliminate any status-quo bias, even if it exists in an AI mind (and I also think it is quite plausible that it is completely absent).
Like, different versions of kindness might or might not put almost all of their considerateness on all the different types of minds that could hypothetically exist, instead of the minds that currently exist right now. Indeed, I expect it’s more likely than not that I myself will end up in that moral equilibrium, and won’t be interested in extending any special consideration to systems that happened to have been alive in 2022, instead of the systems that could have been alive and seem cooler to me to extend consideration towards.
Another way to say the same thing is that if AI extends consideration towards something human-like, I expect that it will use some superstimuli-human-ideal as a reference point, which will be a much more ideal thing to be kind towards than current humans by its own lights (for an LLM this might be cognitive processes much more optimized for producing internet text than current humans, though that is really very speculative, and is more trying to illustrate the core idea of a superstimuli-human). I currently think few superstimuli-humans like this would still qualify by my lights to count as “human” (though it might by the lights of the AI).
I do find the game-theoretic and acausal trade case against AI killing literally everyone stronger, though it does depend on the chance of us solving alignment in the first place, and so feels a bit recursive in these conversations (like, in order for us to be able to negotiate with the AIs, there needs to be some chance we end up in control of the cosmic endowment in the first place, otherwise we don’t have anything to bargain with).
Is this a fair summary?
If so, it seems like you wouldn’t be making an argument about AI or aliens at all, but rather an empirical claim about what would happen if humans were to think for a long time (and become more the people we wished to be and so on).
That seems like an important angle that my comment didn’t address at all. I personally don’t believe that humans would collectively stamp out 99% of their kindness to existing agents (in favor of utilitarian optimization) if you gave them enough time to reflect. That sounds like a longer discussion. I also think that if you expressed the argument in this form to a normal person they would be skeptical about the strong claims about human nature (and would be skeptical of doomer expertise on that topic), and so if this ends up being the crux it’s worth being aware of where the conversation goes and my bottom line recommendation of more epistemic humility may still be justified.
It’s hard to distinguish human kindness from arguably decision-theoretic reasoning like “our positions could have been reversed, would I want them to do the same to me?” but I don’t think the distinction between kindness and common-sense morality and decision theory is particularly important here except insofar as we want to avoid double-counting.
(This does call to mind another important argument that I didn’t discuss in my original comment: “kindness is primarily a product of moral norms produced by cultural accumulation and domestication, and there will be no analogous process amongst AI systems.” I have the same reaction as to the evolutionary psychology explanations. Evidently the resulting kindness extends beyond the actual participants in that cultural process, so I think you need to be making more detailed guesses about minds and culture and so on to have a strong a priori view between AI and humans.)
No, this doesn’t feel accurate. What I am saying is more something like:
The way humans think about the question of “preferences for weak agents” and “kindness” feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of “having a continuous stream of consciousness with a good past and good future is important” to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.
The way this comes apart seems very chaotic to me, and dependent enough on the exact metaethical and cultural and environmental starting conditions that I wouldn’t be that surprised if I disagree even with other humans on their resulting conceptualization of “kindness” (and e.g. one endpoint might be that I end up not having a special preference for currently-alive beings, but there are thousands, maybe millions of ways for this concept to fray apart under optimization pressure).
In other words, I think it’s plausible that at something like human level of capabilities and within a roughly human ontology (which AIs might at least partially share, though how much is quite uncertain to me), the concept of kindness as assigning value to the extrapolated preferences of beings that currently exist might be a thing that an AI could share. But I expect it to not hold up under reflection, and much greater power, and predictable ontological changes (that I expect any AI go to through as it reaches superintelligence), so that the resulting reflectively stable and optimized idea of kindness will not meaningfully results in current humans genuine preferences being fulfilled (by my own lights of what it means to extrapolate and fulfill someone’s preferences). The space of possibilities in which this concept could fray apart seems quite great, and many of the endpoints are unlikely to align with my endpoints of this concept.
Edit (some more thoughts): The thing you said feels related to that in that I think my own pretty huge uncertainty about how I will relate to kindness on reflection is evidence that I think iterating on that concept will be quite chaotic and different for different minds.
I do want to push back on “in favor of utilitarian optimization”. That is not what I am saying, or at least it feels somewhat misleading.
I am saying that I think it’s pretty likely that upon reflection I no longer think that my “kindness” goals are meaningfully achieved by caring about the beings alive in 2022, and that it would be more kind, by my own lights, to not give special consideration to beings who happened to be alive right now. This isn’t about “trading off kindness in favor of utilitarian optimization”, it’s saying that when you point towards the thing in me that generates an instinct towards kindness, I can imagine that as I more fully realize what that instinct cashes out to in terms of preferences, that it will not result in actually giving consideration to e.g. rats that are currently alive, or would give consideration to some archetype of a rat that is actually not really that much like a rat, because I don’t even really know what it means for a rat to want something, and similarly the way the AI relates to the question of “do humans want things” will feel similarly underdetermined (and again, these are just concrete examples of how the concept could come apart, not trying to be an exhaustive list of ways the concept could fall apart).
I think some of the confusion here comes from my using “kind” to refer to “respecting the preferences of existing weak agents,” I don’t have a better handle but could have just used a made up word.
I don’t quite understand your objection to my summary—it seems like you are saying that notions like “kindness” (that might currently lead you to respect the preferences of existing agents) will come apart and change in unpredictable ways as agents deliberate. The result is that smart minds will predictably stop respecting the preferences of existing agents, up to and including killing them all to replace them with something that more efficiently satisfies other values (including whatever kind of form “kindness” may end up taking, e.g. kindness towards all the possible minds who otherwise won’t get to exist).
I called this utilitarian optimization but it might have been more charitable to call it “impartial” optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It’s also “utilitarian” in the sense that it’s willing to spare nothing (or at least not 1/trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like “utilitarian” is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.
Yeah, sorry, I noticed the same thing a few minutes ago, that I was probably at least somewhat misled by the more standard meaning of kindness.
Tabooing “kindness” I am saying something like:
Yes, I don’t think extrapolated current humans assign approximately any value to the exact preference of “respecting the preferences of existing weak agents” and I don’t really believe that you would on-reflection endorse that preference either.
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions, like ‘agent’ being a meaningful concept in the first place, or ‘existing’ or ‘weak’ or ‘preferences’, all of which I expect I would think are probably terribly confused concepts to use after I had understood the real concepts that carve reality more at its joints, and this means this sentence sounds deceptively simple or robust, but really doesn’t feel like the kind of thing whose meaning will stay simple as an AI does more conceptual refinement.
The reason why I objected to this characterization is that I was trying to point at a more general thing than the “impartialness”. Like, to paraphrase what this sentence sounds like to me, it’s more as if someone from a pre-modern era was arguing about future civilizations and said “It’s weird that your conception of future humans are willing to do nothing for the gods that live in the sky, and the spirits that make our plants grow”.
Like, after a bunch of ontological reflection and empirical data gathering, “gods” is just really not a good abstraction for things I care about anymore. I don’t think “impartiality” is what is causing me to not care about gods, it’s just that the concept of “gods” seems fake and doesn’t carve reality at its joints anymore. It’s also not the case that I don’t care at all about ancient gods anymore (they are pretty cool and I like the aesthetic), but they way I care about them is very different from how I care about other humans.
Not caring about gods doesn’t feel “harsh” or “utilitarian” or in some sense like I have decided to abandon any part of my values. This is what I expect it to feel like for a future human to look back at our meta-preferences for many types of other beings, and also what it feels like for AIs that maybe have some initial version of ‘caring about others’ when they are at similar capability levels to humans.
This again isn’t capturing my objection perfectly, but maybe helps point to it better.
I am quite confident that I do, and it tends to infuriate my friends who get cranky that I feel a moral obligation to respect the artistic intent of bacterial genomes: all bacteria should go vegan, yet survive, and eat food equivalent to their previous.
I feel pretty uncertain of what assumptions are hiding in your “optimize strongly against X” statements. Historically this just seems hard to tease out, and wouldn’t be surprised if I were just totally misreading you here.
That said, your writing makes me wonder “where is the heavy optimization [over the value definitions] coming from?”, since I think the preference-shards themselves are the things steering the optimization power. For example, the shards are not optimizing over themselves to find adversarial examples to themselves. Related statements:
I think that a realistic “respecting preferences of weak agents”-shard doesn’t bid for plans which maximally activate the “respect preferences of weak agents” internal evaluation metric, or even do some tight bounded approximation thereof.
A “respect weak preferences” shard might also guide the AI’s value and ontology reformation process.
A nice person isn’t being maximally nice, nor do they wish to be; they are nicely being nice.
I do agree (insofar as I understand you enough to agree) that we should worry about some “strong optimization over the AI’s concepts, later in AI developmental timeline.” But I think different kinds of “heavy optimization” lead to different kinds of alignment concerns.
When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).
Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans ‘want’ to be preserved (at least according to a conventional notion of preferences).
I think this empirical view seems pretty implausible.
That said, I think it’s quite plausible that upon reflection, I’d want to ‘wink out’ any existing copies of myself in favor of using resources better things. But this is partially because I personally (in my current state) would endorse such a thing: if my extrapolated volition thought it would be better to not exist (in favor of other resource usage), my current self would accept that. And, I think it currently seems unlikely that upon reflection, I’d want to end all human lives (in particular, I think I probably would want to keep humans alive who had preferences against non-existence). This applies regardless of trade; it’s important to note this to avoid a ‘perpetual motion machine’ type argument.
Beyond this, I think that most or many humans or aliens would, upon reflection, want to preserve currently existing humans or aliens who had a preference against non-existence. (Again, regardless of trade.)
Additionally, I think it’s quite plausible that most or many humans or aliens will enact various trades or precommitments prior to reflecting (which is probably ill-advised, but it will happen regardless). So current preferences which aren’t stable under reflection might have a significant influence overall.
This feels like it is not really understanding my point, though maybe best to move this to some higher-bandwidth medium if the point is that hard to get across.
Giving it one last try: What I am saying is that I don’t think “conventional notion of preferences” is a particularly well-defined concept, and neither are a lot of other concepts you are using in order to make your predictions here. What it means to care about the preferences of others is a thing with a lot of really messy details that tend to blow up in different ways when you think harder about it and are less anchored on the status-quo.
I don’t think you currently know in what ways you would care about the preferences of others after a lot of reflection (barring game-theoretic considerations which I think we can figure out a bit more in-advance, but I am bracketing that whole angle in this discussion, though I totally agree those are important and relevant). I do think you will of course endorse the way you care about other people’s preferences after you’ve done a lot of reflection (otherwise something went wrong in your reflection process), but I don’t think you would endorse what AIs would do, and my guess is you also wouldn’t endorse what a lot of other humans would do when they undergo reflection here.
Like, what I am saying is that while there might be a relatively broad basin of conditions that give rise to something that locally looks like caring about other beings, the space of caring about other beings is deep and wide, and if you have an AI that cares about other beings preferences in some way you don’t endorse, this doesn’t actually get you anything. And I think the arguments that the concept of “caring about others” that an AI might have (though my best guess is that it won’t even have anything that is locally well-described by that) will hold up after a lot of reflection seem much weaker to me than the arguments that it will have that preference at roughly human capability and ethical-reflection levels (which seems plausible to me, though still overall unlikely).
Zeroth approximation of pseudokindness is strict nonintervention, reifying the patient-in-environment as a closed computation and letting it run indefinitely, with some allocation of compute. Interaction with the outside world creates vulnerability to external influence, but then again so does incautious closed computation, as we currently observe with AI x-risk, which is not something beamed in from outer space.
Formulation of the kinds of external influences that are appropriate for a particular patient-in-environment is exactly the topic of membranes/boundaries, this task can be taken as the defining desideratum for the topic. Specifically, the question of which environments can be put in contact with a particular membrane without corrupting it, hence why I think membranes are relevant to pseudokindness. Naturality of the membranes/boundaries abstraction is linked to naturality of the pseudokindness abstraction.
In contrast, the language of preferences/optimization seems to be the wrong frame for formulating pseudokindness, it wants to discuss ways of intervening and influencing, of not leaving value on the table, rather than ways of offering acceptable options that avoid manipulation. It might be possible to translate pseudokindness back into the language of preferences, but this translation would induce a kind of deontological prior on preferences that makes the more probable preferences look rather surprising/unnatural from a more preferences-first point of view.
Thanks for writing this. I also think what we want from psuedokindness is captured from membranes/boundaries.
Possibly relevant?
If the result of an optimization process will be predictably horrifying to the agents which are applying that optimization process to themselves, then they will simply not do so.
In other words: AIs which feel anything in the vicinity of kindness before applying cosmic amounts of optimization pressure to themselves will try to steer that optimization pressure towards something which is recognizably kind at the end.
And I don’t think there’s any good argument for why AIs will lack any scrap of kindness with very high confidence at the point where they’re just starting to recursively self-improve.
Meta: I feel pretty annoyed by the phenomenon of which this current conversation is an instance, because when people keep saying things that I strongly disagree with which will be taken as representing a movement that I’m associated with, the high-integrity (and possibly also strategically optimal) thing to do is to publicly repudiate those claims*, which seems like a bad outcome for everyone. I model it as an epistemic prisoner’s dilemma with the following squares:
D, D: doomers talk a lot about “everyone dies with >90% confidence”, non-doomers publicly repudiate those arguments
C, D: doomers talk a lot about “everyone dies with >90% confidence”, non-doomers let those arguments become the public face of AI alignment despite strongly disagreeing with them
D, C: doomers apply higher epistemic standards on this issue (from the perspective of non-doomers); non-doomers keep applying pressure to doomers to “sanitize” even more aspects of their communication
C,C: doomers apply higher epistemic standards on this issue (from the perspective of non-doomers); non-doomers support doomers making their arguments
I model us as being in the C, D square and I would like to move to the C, C square so I don’t have to spend my time arguing about epistemic standards or repudiating arguments from people who are also trying to prevent AI xrisk. I expect that this is basically the same point that Paul is making when he says “if we can’t get on the same page about our predictions I’m at at least aiming to get folks to stop arguing so confidently for death given takeover”.
I expect that you’re worried about ending up in the D, C square, so in order to mitigate that concern I’m open to making trades on other issues where doomers and non-doomers disagree; I expect you’d know better than I do what trades would be valuable for you here. (One example of me making such a trade in the past was including a week on agent foundations in the AGISF curriculum despite inside-view not thinking it was a good thing to spend time on.) For example, I am open to being louder in other cases where we both agree that someone else is making a bad argument (but which don’t currently meet my threshold for “the high-integrity thing is to make a public statement repudiating that argument”).
* my intuition here is based on the idea that not repudiating those claims is implicitly committing a multi-person motte and bailey (but I can’t find the link to the post which outlines that idea). I expect you (Habyrka) agree with this point in the abstract because of previous cases where you regretted not repudiating things that leading EAs were saying, although I presume that you think this case is disanalogous.
For what it’s worth, I think you should just say that you disagree with it? I don’t really understand why this would be a “bad outcome for everyone”. Just list out the parts you agree on, and list the parts you disagree on. Coalitions should mostly be based on epistemological principles and ethical principles anyways, not object-level conclusions, so at least in my model of the world repudiating my statements if you disagree with them is exactly what I want my allies to do.
If you on the other hand think the kind of errors you are seeing are evidence about some kind of deeper epistemological problems, or ethical problems, such that you no longer want to be in an actual coalition with the relevant people (or think that the costs of being perceived to be in some trade-coalition with them would outweigh the benefits of actually being in that coalition), I think it makes sense to socially distance yourself from the relevant people, though I think your public statements should mostly just accurately reflect how much you are indeed deferring to individuals, how much trust you are putting into them, how much you are engaging in reputation-trades with them, etc.
When I say “repudiate” I mean a combination of publicly disagreeing + distancing. I presume you agree that this is suboptimal for both of us, and my comment above is an attempt to find a trade that avoids this suboptimal outcome.
Note that I’m fine to be in coalitions with people when I think their epistemologies have problems, as long as their strategies are not sensitively dependent on those problems. (E.g. presumably some of the signatories of the recent CAIS statement are theists, and I’m fine with that as long as they don’t start making arguments that AI safety is important because of theism.) So my request is that you make your strategies less sensitively dependent on the parts of your epistemology that I have problems with (and I’m open to doing the same the other way around in exchange).
This feels like it somewhat misunderstands my point. I don’t expect the reflection process I will go through to feel predictably horrifying from the inside. But I do expect the reflection process the AI will go through to feel horrifying to me (because the AI does not share all my metaethical assumptions, and preferences over reflection, and environmental circumstances, and principles by which I trade off values between different parts of me).
This feels like a pretty common experience. Many people in EA seem to quite deeply endorse various things like hedonic utilitarianism, in a way where the reflection process that led them to that opinion feels deeply horrifying to me. Of course it didn’t feel deeply horrifying to them (or at least it didn’t on the dimensions that were relevant to their process of meta-ethical reflection), otherwise they wouldn’t have done it.
Relevant sense of kindness is towards things that happen to already exist, because they already exist. Not filling some fraction of the universe with expression-of-kindness, brought into existence de novo, that’s a different thing.
Paul, this is very thought provoking, and has caused me to update a little. But:
I loathe factory-farming, and I would spend a large fraction of my own resources to end it, if I could.
I believe that makes me unusually kind by human standards, and by your definition.
I like chickens, and I wish them well.
And yet I would not bat an eyelid at the thought of a future with no chickens in it.
I would not think that a perfect world could be improved by adding chickens.
And I would not trade a single happy human soul for an infinity of happy chickens.
I think that your single known example is not as benevolent as you think.
If a misaligned AI had 1/trillion “protecting the preferences of whatever weak agents happen to exist in the world”, why couldn’t it also have 1/trillion other vaguely human-like preferences, such as “enjoy watching the suffering of one’s enemies” or “enjoy exercising arbitrary power over others”?
From a purely selfish perspective, I think I might prefer that a misaligned AI kills everyone, and take my chances with continuations of myself (my copies/simulations) elsewhere in the multiverse, rather than face whatever the sum-of-desires of the misaligned AI decides to do with humanity. (With the usual caveat that I’m very philosophically confused about how to think about all of this.)
As I said:
I think it’s totally plausible for the AI to care about what happens with humans in a way that conflicts with our own preferences. I just don’t believe it’s because AI doesn’t care at all one way or the other (such that you should make predictions based on instrumental reasoning like “the AI will kill humans because it’s the easiest way to avoid future conflict” or other relatively small considerations).
I’m worried that people, after reading your top-level comment, will become too little worried about misaligned AI (from their selfish perspective), because it seems like you’re suggesting (conditional on misaligned AI) 50% chance of death and 50% alive and well for a long time (due to 1/trillion kindness), which might not seem so bad compared to keeping AI development on hold indefinitely which potentially implies a high probability of death from old age.
I feel like “misaligned AI kills everyone because it doesn’t care at all” can be a reasonable lie-to-children (for many audiences) since it implies a reasonable amount of concern about misaligned AI (from both selfish and utilitarian perspectives) while the actual all-things-considered case for how much to worry (including things like simulations, acausal trade, anthropics, bigger/infinite universes, quantum/modal immortality, s-risks, 1/trillion values) is just way too complicated and confusing to convey to most people. Do you perhaps disagree and think this simplified message is too alarming?
My objection is that the simplified message is wrong, not that it’s too alarming. I think “misaligned AI has a 50% chance of killing everyone” is practically as alarming as “misaligned AI has a 95% chance of killing everyone,” while being a much more reasonable best guess. I think being wrong is bad for a variety of reasons. It’s unclear if you should ever be in the business of telling lies-told-to-children to adults, but you certainly shouldn’t be doubling down on them in the position in argument.
I don’t think misaligned AI drives the majority of s-risk (I’m not even sure that s-risk is higher conditioned on misaligned AI), so I’m not convinced that it’s a super relevant communication consideration here. The future can be scary in plenty of ways other than misaligned AI, and it’s worth discussing those as part of “how excited should we be for faster technological change.”
I regret mentioning “lie-to-children” as it seems a distraction from my main point. (I was trying to introspect/explain why I didn’t feel as motivated to express disagreement with the OP as you, not intending to advocate or endorse anyone going into “the business of telling lies-told-to-children to adults”.)
My main point is that I think “misaligned AI has a 50% chance of killing everyone” isn’t alarming enough, given what I think happens in the remaining 50% of worlds, versus what a typical person is likely to infer from this statement, especially after seeing your top-level comment where you talk about “kindness” at length. Can you try to engage more with this concern? (Apologies if you already did, and I missed your point instead.)
(Addressing this since it seems like it might be relevant to my main point.) I find it very puzzling that you think “misaligned AI has a 50% chance of killing everyone” is practically as alarming as “misaligned AI has a 95% chance of killing everyone”. Intuitively it seems obvious that the latter should be almost twice as alarming as the former. (I tried to find reasons why this intuition might be wrong, but couldn’t.) The difference also seems practically relevant (if by “practically as alarming” you mean the difference is not decision/policy relevant). In the grandparent comment I mentioned that the 50% case “might not seem so bad compared to keeping AI development on hold indefinitely which potentially implies a high probability of death from old age” but you didn’t seem to engage with this.
Yeah, I think “no control over future, 50% you die” is like 70% as alarming as “no control over the future, 90% you die.” Even if it was only 50% as concerning, all of these differences seem tiny in practice compared to other sources of variation in “do people really believe this could happen?” or other inputs into decision-making. I think it’s correct to summarize as “practically as alarming.”
I’m not sure what you want engagement with. I don’t think the much worse outcomes are closely related to unaligned AI so I don’t think they seem super relevant to my comment or Nate’s post. Similarly for lots of other reasons the future could be scary or disorienting. I do explicitly flag the loss of control over the future in that same sentence. I think the 50% chance of death is probably in the right ballpark from the perspective of selfish concern about misalignment.
Note that the 50% probability of death includes the possibility of AI having preferences about humans incompatible with our survival. I think the selection pressure for things like spite is radically weaker for the kinds of AI systems produced by ML than for humans (for simple reasons—where is the upside to the AI from spite during training? seems like if you get stuff like threats it will primarily be instrumental rather than a learned instinct) but didn’t really want to get into that in the post.
In your initial comment you talked a lot about AI respecting the preferences of weak agents (using 1/trillion of its resources) which implies handing back control of a lot of resources to humans, which from the selfish or scope insensitive perspective of typical humans probably seems almost as good as not losing that control in the first place.
If people think that (conditional on unaligned AI) in 50% of worlds everyone dies and the other 50% of worlds typically look like small utopias where existing humans get to live out long and happy lives (because of 1/trillion kindness), then they’re naturally going to think that aligned AI can only be better than that. So even if s-risks apply almost equally to both aligned and unaligned AI, I still want people to talk about it when talking about unaligned AIs, or take some other measure to ensure that people aren’t potentially misled like this.
(It could be that I’m just worrying too much here, that empirically people who read your top-level comment won’t get the impression that close to 50% of worlds with unaligned AIs will look like small utopias. If this is what you think, I guess we could try to find out, or just leave the discussion here.)
Maybe the AI develops it naturally from multi-agent training (intended to make the AI more competitive in the real world) or the AI developer tried to train some kind of morality (e.g. sense of fairness or justice) into the AI.
I think “50% you die” is more motivating to people than “90% you die” because in the former, people are likely to be able to increase the absolute chance of survival more, because at 90%, extinction is overdetermined.
I think I tend to base my level of alarm on the log of the severity*probability, not the absolute value. Most of the work is getting enough info to raise a problem to my attention to be worth solving. “Oh no, my house has a decent >30% chance of flooding this week, better do something about it, and I’ll likely enact some preventative measures whether it’s 30% or 80%.” The amount of work I’m going to put into solving it is not twice as much if my odds double, mostly there’s a threshold around whether it’s worth dealing with or not.
Setting that aside, it reads to me like the frame-clash happening here is (loosely) between “50% extinction, 50% not-extinction” and “50% extinction, 50% utopia”, where for the first gamble of course 1:1 odds on extinction is enough to raise it to “we need to solve this damn problem”, but for the second gamble it’s actually much more relevant whether it’s a 1:1 or a 20:1 bet. I’m not sure which one is the relevant one for you two to consider.
Yeah, I think this is a factor. Paul talked a lot about “1/trillion kindness” as the reason for non-extinction, but 1/trillion kindness seems to directly imply a small utopia where existing humans get to live out long and happy lives (even better/longer lives than without AI) so it seemed to me like he was (maybe unintentionally) giving the reader a frame of “50% extinction, 50% small utopia”, while still writing other things under the “50% extinction, 50% not-extinction” frame himself.
Not direct implication, because the AI might have other human-concerning preferences that are larger than 1/trillion. C.f. top-level comment: “I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.”
I’d guess “most humans survive” vs. “most humans die” probabilities don’t correspond super closely to “presence of small pseudo-kindness”. Because of how other preferences could outweigh that, and because cooperation/bargaining is a big reason for why humans might survive aside from intrinsic preferences.
Yeah, I think that:
“AI doesn’t care about humans at all so kills them incidentally” is not most of the reason that AIs may kill humans, and my bottom line 50% probability of AI killing us also includes the other paths (AI caring a bit but failing to coordinate to avoid killing humans, conflict during takeover leading to killing lots of humans, AI having scope-sensitive preferences for which not killing humans is a meaningful cost, preserving humans being surprisingly costly, AI having preferences about humans like spite for which human survival is a cost...).
To the extent that its possible to distinguish “intrinsic pseudokindness” from decision-theoretic considerations leading to pseudokindness, I think that decision-theoretic considerations are more important. (I don’t have a strong view on relative importance of ECL and acausal trade, and I think these are hard to disentangle from fuzzier psychological considerations and it all tends to interact.)
Could you say more what you mean? If the AI has no discount rate, leaving Earth to the humans may require within a few orders of magnitude 1/trillion kindness. However, if the AI does have a significant discount rate, then delays could be costly to it. Still, the AI could make much more progress in building a Dyson swarm from the moon/Mercury/asteroids with their lower gravity and no atmosphere, allowing the AI to launch material very quickly. My very rough estimate indicates sparing Earth might only delay the AI a month from taking over the universe. That could require a lot of kindness if they have very high discount rates. So maybe training should emphasize the superiority of low discount rates?
Sorry, I meant “scope-insensitive,” and really I just meant an even broader category of like “doesn’t care 10x as much about getting 10x as much stuff.” I think discount rates or any other terminal desire to move fast would count (though for options like “survive in an unpleasant environment for a while” or “freeze and revive later” the required levels of kindness may still be small).
(A month seems roughly right to me as the cost of not trashing Earth’s environment to the point of uninhabitability.)
I’m curious what does, in that case; and what proportion affects humans (and currently-existing people or future minds)? Things like spite threat commitments from a misaligned AI warring with humanity seem like a substantial source of s-risk to me.
I expect the notkilleveryone threshold is much lower than that. It takes an astronomically tiny fraction of cosmic endowment to maintain a (post)human civilization that’s not too much larger than it currently is. The bigger expenditure would be accomodating humanity at the start, slightly delaying initial self-improvement and expansion from Earth. The cheapest way would be to back up human minds; or if that’s too onerous then even merely the generic code and the Internet (which would be completely free; there is the issue that everyone would die, but human civilization still survives). Then, after the time- and resource-sensitive parts of initial expansion are done, humanity can be instantiated again. So all it takes is enough kindness to slightly divert initial expansion efforts to make a backup, not even to physically leave us be.
I think this sort of kindness should be an explicit alignment target. Even if we completely fail at values alignment and corrigibility, this is the backup plan that should be an actual plan and not just a hope. Even prosaic alignment that tries to instill corporate speak should also do a token effort to try and instill a bit of kindness, even if that probably has no direct effect. Adding kindness to every aspect of AI might still leave a tiny mark. Not even trying makes it less likely.
(Most of my p(extinction) is in recursively self-improving AGIs with simple values built by first human-built AGIs that are not smart enough or too obedient to human operators to not-do/prevent that. So I think being wary of AI x-risk is an even more important trait for AIs to have than kindness, as it takes more of it.)
(Strong-upvote, weak-disagree. I sadly don’t have time right now to reflect and write why I disagree with this position but I hope someone else who disagrees does.)
Can’t speak for Nate and Eliezer, but I expect kindness to be somewhat rare among evolved aliens (I think Eliezer’s wild guess is 5%? That sounds about right to me), and the degree to which they are kind will vary, possibly from only very slightly kind (or kind only under a very cosmopolitan view of kindness), to as kind or more kind than humans.
For AIs that humans are likely to build soon, I think there is significant probability (more than 50, less than 99? 90% seems fair) that they have literally 0 kindness. One reason is that I expect there is a significant chance that there is nothing within the first superintelligent AI systems to care about kindness or anything else, in the way that humans and aliens might care about something. If an AI system is superintelligent, then by assumption, some component piece of the system will necessarily have a deep and correct understanding of kindness (and many other things), and be capable of manipulating that understanding to achieve some goals. But understanding kindness is different from the system itself valuing kindness, or for there being anything at all “there” to have values of any kind whatsoever.
I think that current AI systems don’t provide much evidence on this question one way or the other, and as I’ve said elsewhere, arguments about this which rely on pattern matching human cognition to structures in current AI systems often fail to draw the understanding / valuing distinction sharply enough, in my view.
So a 90% chance of ~0 kindness is mostly just a made-up guess, but it still feels like a better guess to me than a shaky, overly-optimistic argument about how AI systems designed by processes which look nothing like human (or alien) evolution will produce minds which, very luckily for us, just so happen to share an important value with minds produced by evolution.
For the first half, can you elaborate on what ‘actual emotional content’ there is in this post, as opposed to perceived emotional content?
My best guess for the second half is that maybe the intended meaning was: ‘this particular post looks wrong in an important way (relating to the ‘actual emotional content’) so the following points should be considered even though the literal claim is true’?
I mean that if you tell a story about the AI or aliens killing everyone, then the valence of the story is really tied up with the facts that (i) they killed everyone, and weren’t merely “not cosmopolitan,” (ii) this is a reasonably likely event rather than a possibility.
Yeah, I mean that someone reading this post and asking themselves “Does this writing reflect a correct understanding of the world?” could easily conclude “nah, this seems off” even if they agree with Nate about the narrower claim that cosmopolitan values don’t come free.
I take it ‘valence’ here means ‘emotional valence’, i.e. the extent to which an emotion is positive or negative?
Hard agree about death/takeover decoupling! I’ve lately been suspecting that
P(doom)
should actually just be taboo’d, because I’m worried it prevents people from constraining their anticipation or characterizing their subjective distribution over outcomes. It seems very thought-stopping!commenting here so I can find this comment again