I don’t think you can say that without first having defined what a “CEV” is. How do you know someone won’t just go insane in a CEV process? how do you know they don’t just get replaced by an amoral paperclipper-ASI in the CEV sim? If your proposed CEV process doesn’t have a specific reason to expect to be robust to that, it should be expected that many people, if run through your process, would produce nonsense outputs that are more or less literally a misaligned AI.
Also, you listed off the practical requirements of staying in power. Why should I expect that people who are able to exhibit those traits don’t have leakage from instrumental goal to terminal goal about those behaviors? Right now it looks to me like a lot of what’s wrong with the world is that sort of leakage from instrumental badness to terminal badness.
I don’t think you can say that without first having defined what a “CEV” is
I… link to the standard Arbital/LW-wiki page for it? I also separately define it in a paragraph. I am not saying that this is some kind of truly amazing definition that resolves all ambiguity or uncertainty around CEV proposals, but I don’t super want to rehash all of that conversation here.
If your proposed CEV process doesn’t have a specific reason to expect to be robust to that, it should be expected that many people, if run through your process, would produce nonsense outputs that are more or less literally a misaligned AI.
I don’t think this is true, and I haven’t heard this argument before, but separately that’s not really what this post is about. I agree that in as much as the best CEV mechanism we can come up with is one that randomly causes some people to become agents of a misaligned AI, that sucks, but it isn’t a reason to expect Vladimir Putin’s CEV in particular to be worse.
Right now it looks to me like a lot of what’s wrong with the world is that sort of leakage from instrumental badness to terminal badness.
I am not sure what you mean by this. I have no idea what my “terminal goals” are, and I think neither do you, or anyone else, so you must be using those words differently than I am using them.
I think a lot of common-sense principles of moral reflection would end up pushing pretty heavily against instrumental goals ending up as terminal goals in a dumb way. Humans do generally get bored of things, we try to imagine counterfactuals and we try to not update on randomly contingent facts. But it seems like you are trying to import some kind of empirical evidence here, and so my guess is something is going wrong earlier in the communication
the standard arbital/lw page does not define it mathematically in a way that we know to be semantically what that page’s current contents refer to. If it did we would be a lot closer to solving alignment! Speculating about what a CEV proposal would do without having one seems a bit silly to me. I’ve worked on actual CEV proposals (not particularly good ones, mind you, just things in the genre of paulboxing self-prompting chains in an HCH-ish structure) and one of the things that makes me say that they’re not good is that we can’t confidently expect them to match what the CEV definition on that page provides. But to discuss this sort of “is x person’s CEV good?” question you do actually need a precise definition of what the math of a CEV would be! It doesn’t make sense to discuss otherwise. And I don’t think anyone has a definition of a CEV that doesn’t boil down to “run a sim for a long time” in some way or another, which seems to me to have pretty severe failure modes, and not guarantee anything like your initial claim.
Sorry to pick on this post rather than some other random CEV-relying post, but it’s an ongoing issue with relying on CEV in one’s concepts that we don’t have an actual definition, just some awkward english we’re not sure how to cash out. Yes, it does seem like you should be correct under some ideal definition of CEV but then you’d have to convince whichever dictator in question you mean to talk to that they should accept your chosen CEV process and not another. Which would be great! If we can show that there’s a CEV process that causes some reasonable form of psychological healing such that your original claim is true, then that would certainly be great. But it seems pretty easy to me to get hellworlds worse than an empty universe or just get an empty universe anyway, if your CEV math is wrong, such that your post ends up mostly turning on the CEV math and not very much on psychology.
For the psychology point I was making: I’ll replace “terminal goal” with “a goal that gets defended” (might be a narrower/broader concept than terminal), and an instrumental goal with “a goal that is allowed to vary” (might be a narrower/broader concept than instrumental). I’m saying that if I have correctly understood you to be claiming that dictators would not want the bad things they seem to enact now, if they didn’t still have pressure to keep wanting those things, then I want to respond that it seems quite plausible to me they would in fact defend their ruthlessness. There seem to me to be a lot of people who internalize valuing of ruthlessness and harm! It doesn’t seem like a particularly rare psychology.
I don’t super want to rehash all of that conversation here.
I mean, ok, but your post kind of fundamentally depends on it. I don’t think CEV is something with a natural abstraction.
I agree very much that discussions of CEV could do with a lot more precision. In recent comments I have been led to talk about Extrapolation Machinery and Value Extrapolation Procedures, so I could make the point that different VEPs will produce different outputs from the same input, and when people talk about CEV, they often have quite different VEPs in mind…
Also, excuse me for using this opportunity to speak in capitals, but DOES ANYONE KNOW WHAT BECAME OF JUNE KU, because metaethical.ai was a shockingly good attempt to formalize CEV.
CEV is not meant to depend on the state of human society. It is supposed to be derived from “human nature”, e.g. genetically determined needs, dispositions, norms and so forth, that are characteristic of our species as a whole
(This is false. CEV is a process that combines extrapolated volitions of individual humans, which is meant to depend fully on the state of every particular person and their wishes about how they wish are to be extrapolated. See the value theory and the metaethics sequences, in particular, stuff like this, as well as the CEV Arbital page. E.g., CEV of humanity is plausibly very different from the CEV of ancient Greeks, who might even, on reflection, want to die gloriously in battles.)
CEV is a process that combines extrapolated volitions of individual humans, which is meant to depend fully on the state of every particular person and their wishes about how they wish are to be extrapolated
I think it is unclear what the exact initial data are supposed to be, or needs to be.
The value system that CEV outputs is going to be abstract at some level. It won’t say directly “if someone has a toothache, fix the toothache”; that should follow from a more general principle, combined with the nature of toothaches. The same goes for the extrapolations of individuals and the aggregations of their preferences: the CEV value system in action has to care about particulars, but what it does with those particulars, will be governed by an abstract definition.
The question is, what do we need to know about humanity, in order for CEV’s Value Extrapolation Procedure to arrive at the correct abstract definition? This is hard to answer if we don’t know what the VEP is in any detail. But finding a correct VEP is also part of the process.
Apparently a popular proposal for the VEP is something like “upload 10,000 philosophers and let them deliberate for as many subjective years as they need to solve all CEV’s problems and arrive at a consensus”, or similar proposals according to which there is a digital parliament of human proxies (e.g. Jan Leike’s “simulated deliberative democracy”).
I guess this defines a possible VEP, but I have long thought that a better VEP would involve theoretical identification of the existing “human decision procedure” (which I assume is a topic for cognitive neuroscience, and which in the individual is determined through a mix of genes, culture, and life incidents), and then extrapolating that. And again, the human decision procedure would in some way be a template, a schema whose details are “filled out” differently in different individuals (similar to how we learn the grammar and vocabulary of our native languages); and some of CEV’s extrapolation would depend on those details, some of it only on the structure of the schema.
You might even expect that Leike’s democracy would arrive at something like this, rather than just deciding everything via a vote among our extrapolated higher selves, forever. But then do you need the whole digression into upload societies devoted to the task of alignment? You just do AI-assisted neuroscience, figure out how human nature actually works, and “extrapolate” that.
Years ago, I thought that might be what would happen. Instead, the VEP that our frontier AI companies are employing, is to engage in value learning from the training corpora, as part of general world-modeling, and then refining and activating it with RHLF, constitutions, and so forth.
Am I wrong to think that if someone presented you with an alignment proposal roughly as handwavy as your argument in this post (including the linked wiki page and follow-up comments), you would be annoyed and consider it basically worthless? If not, where’s the key asymmetry that means I should find it reassuring anyway?
In the AI alignment case, I think I get why standards are so high: you expect that we’ll only have one chance, the solution has to actually be implemented rather than gestured at, failure means the loss of everything, and misplaced confidence could meaningfully increase risk.
It seems to me that the stakes are similarly high here, as we’re basically talking about someone elevating themself to god-emperor of the light cone. If it turns out that oops, they actually will retain and act on sadistic preferences, or the most efficient way to produce meat (or intelligence) actually does involve terrible suffering which they’ll be oblivious to or unconcerned about, then we could very easily end up with a world that is much worse than nothingness (by my values, and I think those of a significant number of reasonable people).
I’m not sure whether you would claim that you have made strong arguments against these possibilities and I’m wrong not to be convinced, or you would agree that you’ve mainly gestured at your own reasons not to worry so much (at least relative to the risk of unaligned AI takeover).
Am I wrong to think that if someone presented you with an alignment proposal roughly as handwavy as your argument in this post (including the linked wiki page and follow-up comments), you would be annoyed and consider it basically worthless? If not, where’s the key asymmetry that means I should find it reassuring anyway?
I don’t find it that reassuring! But also, complete reassurance seems a bit mistaken to aim for here. The kind of decision I am talking about is high stakes on both sides, so there isn’t any particularly obvious conservative action to take (of course, I think the actual thing we should do is not build ASI and not put anyone in this position for a long while, but that’s not the point of the post).
I also certainly wouldn’t consider someone thinking or writing about an alignment proposal in a similar way annoying or worthless. If you have some that you could write up in a similar fashion and depth, please do!
Also, not sure what you mean by the “linked wiki page” being “handwavy”? I mean, CEV is kind of tricky, but I certainly wouldn’t describe the whole thing as “handwavy”?
I guess overall, isn’t… this whole website full of relatively early-stage alignment proposals explained usually at a much lower level of depth?
I’m not sure whether you would claim that you have made strong arguments against these possibilities, and I’m wrong not to be convinced, or whether you would agree that you’ve mainly gestured at your own reasons not to worry so much (at least relative to the risk of unaligned AI takeover).
The post’s opening line is this being a quick post that I hope someone else does a better job of sometime. I think it’s pretty reasonable to not be super compelled, and it certainly deserves a much longer and greater treatment.
Thanks for responding, and point taken that you don’t find it that reassuring and are okay with similarly incomplete alignment proposals.
When I called the CEV page ‘handwavy’, I didn’t mean it wasn’t a good-faith attempt to explain the concept. I think it’s handwavy relative to an account of exactly what it means, at the level required for me to understand how it would actually be implemented, why I should be happy with the consequences, and why I should expect it to emerge from the real-world process of a seemingly bad guy taking full control of an ASI. (Which I admit is a very high bar! But the stakes are high and my priors are low.)
When I called the CEV page ‘handwavy’, I didn’t mean it wasn’t a good-faith attempt to explain the concept. I think it’s handwavy relative to an account of exactly what it means, at the level required for me to understand how it would actually be implemented, why I should be happy with the consequences, and why I should expect it to emerge from the real-world process of a seemingly bad guy taking full control of an ASI. (Which I admit is a very high bar! But the stakes are high and my priors are low.)
I certainly would like to see much more work on CEV, though there are many things to do, and it’s been a good enough pointer for the purpose of many discussions like this as it is. But I certainly would not object, and would be excited, about someone making more progress on fleshing it out.
Well, I think this post is substantially intended to engage with people’s priors. Possibly I expressed one of my points better in this comment:
But also, IDK, my sense is people really really like to imagine their enemies as more fundamentally evil than they are, and I doubt that evil and power correlate to the level where this would remotely explain why every nation will always villainize the leaders of its enemies. My best guess is Putin is like ~90th percentile evil, if one was to try to construct a linear scale here. Most of the bits of selection need to go into competence, not evilness. And so when attributing preferences like “he will reshape the universe to be filled with people he can terrorize”, I feel like “this is someone running the evil-sounding sentence generator” is much more likely than “this is an actually legitimate preference I expect him to have”.
I think my biggest worry is not that we’ll end up ruled by someone who is actively sadistic over the long term (though that does terrify me and I don’t think it’s out of the question), but that we’ll end up ruled by someone who is basically indifferent to the suffering of some subset of others. Which seems very plausible to me, because it doesn’t require them to be a cartoon sadistic villain or even a literal psychopath; they just need to have the same tendency toward limited moral concern as most actual humans have, and to retain it through whatever process of uplift they undergo when interacting with their ASI.
Hopefully (though again I’m not confident), most normal humans would widen their circle appropriately in a situation where they were facing no competitive pressures, meaningful scarcity, avoidable ignorance, etc. But if we do end up with a psychopath in charge, I don’t see why they would move from indifference to caring; basically, I wouldn’t expect the is-ought gap to be bridged by whatever new knowledge and intelligence they gained.
In that second case, it seems to me that we need a lot of optimistic assumptions to hold in order to avoid an s-risk style catastrophe. If the ruler simply doesn’t care about the suffering they cause to whichever conscious entities constitute their outgroup, then we only need one of efficiency/ignorance/aesthetic preference/curiosity/other to lean slightly in favour of the horrible thing in order for it to happen.
I think my biggest worry is not that we’ll end up ruled by someone who is actively sadistic over the long term (though that does terrify me and I don’t think it’s out of the question), but that we’ll end up ruled by someone who is basically indifferent to the suffering of some subset of others.
[...]
In that second case, it seems to me that we need a lot of optimistic assumptions to hold in order to avoid an s-risk style catastrophe. If the ruler really doesn’t care about the suffering they cause to whichever conscious entities constitute their outgroup, then we only need one of efficiency/ignorance/aesthetic preference/curiosity/other to lean slightly in favour of the horrible thing in order for it to happen.
I try to address a bit in the post. I do think the default expectation is that complete indifference towards a certain class of person, will just generalize to none of that kind of person existing. Why would they create lots of copies of things they don’t care about?
I’m thinking of animals too, and anything else conscious. So some possible reasons are the production of food and/or intelligence. (I know you sort of argued against the likely existence of suffering in those contexts, but not in enough detail for me to meaningfully update. And I find this point questionable:
even if suffering did show up in the optimal algorithm for some goal, it would take only cosmically minuscule amounts of caring-about-suffering to route around it, and a complete absence of that in humans with intact minds seems unlikely.
It would take only miniscule amounts of caring if the required efficiency sacrifice is miniscule and there are no other contrary motives. In any case, I don’t think “a complete absence of that in humans with intact minds” is sufficiently unlikely. Psychopaths exist, sadists exist, and if we end up with a psychopath in charge, I think it’s entirely plausible that their concern for at least some subset of other conscious entities remains zero or negative; I don’t think you’ve really argued against this.)
A preference for authentic natural environments, combined with indifference to animal suffering (or slight concern outweighed by other concerns), could also lead to the production of immense amounts of suffering forever.
edit: I think there might be too much of a values gap (in that I’m much more negative utilitarian than you) for me to agree with your overall position even if you managed to convince me on most of the factual questions. I take this paragraph to imply that you see the eternal torture of at least thousands (and perhaps some larger number fewer than trillions) of people as a price worth paying for a future that is otherwise not so bad:
But given the full cosmos to fill with goodness, or any appreciable fraction of it, I don’t think you’d spend much on torturing enemies. What’s the point? If you really hate Bob, you can keep Bob on old earth, tortured for eternity. If you have thousands of enemies, you can do that to all of them. But creating trillions of copies of Bob to torture requires a very specific mix of being wrong about game theory while taking an oddly enlightened perspective on other people’s values. Are you really even hurting Bob when you do this? Is that sound decision theory in a world where other people could have ended up inheriting the universe instead?
I know the amount of good stuff in this hypothetical future could be really, really big, and lots of people will think I’m just falling prey to scope insensitivity or something, but I’ve thought about this a lot and my considered position is that preventing the eternal torture is more important than bringing about the good stuff.
(I also don’t get the “a very specific mix of being wrong about game theory while taking an oddly enlightened perspective on other people’s values” part; it could be the simple fulfilment of a genuine sadistic preference.)
I think there might be too much of a values gap (in that I’m much more negative utilitarian than you) for me to agree with your overall position even if you managed to convince me on most of the factual questions.
Yep, I think that’s a very confused moral position! I could argue here against it (as a random example, think about whether you would prefer to live a life that is 99.99999999% great and fulfilling, but once in 10,000 years you would experience a single 100ms of torture, which I think is likely an underestimate of the actual ratios here), but it seems like a big topic.
Certainly if you are a inclined to be a negative utilitarian then this post will not be very reassuring! Indeed almost any human-controlled future I think would end up looking quite bad, though it depends of course on whether you really are fully negative utilitarian.
I think that’s a very confused moral position! I could argue here against it (as a random example, think about whether you would prefer to live a life that is 99.99999999% great and fulfilling, but once in 10,000 years you would experience a single 100ms of torture)
This is only relevant given (at least) three assumptions, one about conscious experience and two about aggregation:
Being tortured for a long time and ‘tortured’ for 100ms differ only in length; there’s nothing in the experience of eternal (or very long) torture that distinguishes it from an infinite (or very large) number of isolated 100ms ‘tortures’
Good is separable (in the sense used by Broome) across time
Good is separable across people
If you’ve engaged seriously with this issue and are willing to write out an argument demonstrating that mine is a confused position, I will happily read and consider it! If not, I think you’re confusing “confused” with “disagrees with me on something I feel is obvious”.
(My position does require me to bite some actual bullets. But so does yours, and unless you’ve thought about this carefully enough to write about it for real, I suspect you’re underestimating how difficult it is to avoid all three of contradiction, vagueness, and weird/counterintuitive conclusions.)
Psychopaths exist, sadists exist, and if we end up with a psychopath in charge, I think it’s entirely plausible that their concern for at least some subset of other conscious entities remains zero or negative;
I do not think psychopaths of this form exist. I might be wrong, but I certainly don’t think the evidence I’ve seen suggests to me there is variation this deep in how humans care about things. Most things in biology are on a spectrum, I would be surprised of psychopathy is not one of those. I maybe should write a general post about “why I don’t believe in most neat psychopathologies”.
I do really wish this field of study was higher quality, and maybe I should do a deep dive and form a more consistent opinion on this. Every time I’ve dug into it I’ve been pretty deeply disappointed into what actual evidence we have for things like “there are people who intrinsically like hurting other people” and “there are people who are completely indifferent to the suffering of others”. It’s not that there is nothing, but it’s clear there is a demonization effect whenever you dive into the literature, where people really want to find categorically evil people, even if the evidence really doesn’t support that.
Edit: Oops, I read the original quote here as “it’s entirely plausible that their concern for other conscious entities remains zero or negative;”. I think it’s quite likely there are people who have zero concern for some other people. I don’t think there are people who have zero concern for all other people.
[Edited to add a trigger warning for “one of the worst examples of evil”.]
You’re obviously right that personality is on a spectrum, but there’s still a tail!
There are people who try to get children on the internet to send them embarrassing photos, then extort the child with the material to perform sex acts or sadistic acts with siblings and record video, escalating into increasingly more sadistic and power-tripping stuff (like cutting themselves and writing with blood), after each time lying about the last ask having been the last, until often the children involved commit suicide because it doesn’t stop.
You can read in prosecutions that the perpetrators communicate with each other about the pleasure they take in it. Whatever you want to call these people, “concern for some conscious entities is zero or negative” describes the situation accurately, and the original quote you’re replying to was about that, not about whether Hare’s checklist carves nature at its joints.
Most things in biology are on a spectrum, I would be surprised of psychopathy is not one of those.
One way to think of it is: there’s a spectrum of how Person A cares about Person B, and this spectrum goes from positive (compassion, desire to help) to neutral (callous indifference) to negative (schadenfreude, desire to pick a fight).
So “it’s a spectrum” is not in itself an argument for optimism here. (Or sorry if I’m misunderstanding.)
I maybe should write a general post about “why I don’t believe in most neat psychopathologies”. I do really wish this field of study was higher quality, and maybe I should do a deep dive and form a more consistent opinion on this…
In case it helps, my take on the psychopathy literature is mostly the same as it was 3 years ago when I wrote this comment.
I don’t think you can say that without first having defined what a “CEV” is. How do you know someone won’t just go insane in a CEV process? how do you know they don’t just get replaced by an amoral paperclipper-ASI in the CEV sim? If your proposed CEV process doesn’t have a specific reason to expect to be robust to that, it should be expected that many people, if run through your process, would produce nonsense outputs that are more or less literally a misaligned AI.
Also, you listed off the practical requirements of staying in power. Why should I expect that people who are able to exhibit those traits don’t have leakage from instrumental goal to terminal goal about those behaviors? Right now it looks to me like a lot of what’s wrong with the world is that sort of leakage from instrumental badness to terminal badness.
I… link to the standard Arbital/LW-wiki page for it? I also separately define it in a paragraph. I am not saying that this is some kind of truly amazing definition that resolves all ambiguity or uncertainty around CEV proposals, but I don’t super want to rehash all of that conversation here.
I don’t think this is true, and I haven’t heard this argument before, but separately that’s not really what this post is about. I agree that in as much as the best CEV mechanism we can come up with is one that randomly causes some people to become agents of a misaligned AI, that sucks, but it isn’t a reason to expect Vladimir Putin’s CEV in particular to be worse.
I am not sure what you mean by this. I have no idea what my “terminal goals” are, and I think neither do you, or anyone else, so you must be using those words differently than I am using them.
I think a lot of common-sense principles of moral reflection would end up pushing pretty heavily against instrumental goals ending up as terminal goals in a dumb way. Humans do generally get bored of things, we try to imagine counterfactuals and we try to not update on randomly contingent facts. But it seems like you are trying to import some kind of empirical evidence here, and so my guess is something is going wrong earlier in the communication
the standard arbital/lw page does not define it mathematically in a way that we know to be semantically what that page’s current contents refer to. If it did we would be a lot closer to solving alignment! Speculating about what a CEV proposal would do without having one seems a bit silly to me. I’ve worked on actual CEV proposals (not particularly good ones, mind you, just things in the genre of paulboxing self-prompting chains in an HCH-ish structure) and one of the things that makes me say that they’re not good is that we can’t confidently expect them to match what the CEV definition on that page provides. But to discuss this sort of “is x person’s CEV good?” question you do actually need a precise definition of what the math of a CEV would be! It doesn’t make sense to discuss otherwise. And I don’t think anyone has a definition of a CEV that doesn’t boil down to “run a sim for a long time” in some way or another, which seems to me to have pretty severe failure modes, and not guarantee anything like your initial claim.
Sorry to pick on this post rather than some other random CEV-relying post, but it’s an ongoing issue with relying on CEV in one’s concepts that we don’t have an actual definition, just some awkward english we’re not sure how to cash out. Yes, it does seem like you should be correct under some ideal definition of CEV but then you’d have to convince whichever dictator in question you mean to talk to that they should accept your chosen CEV process and not another. Which would be great! If we can show that there’s a CEV process that causes some reasonable form of psychological healing such that your original claim is true, then that would certainly be great. But it seems pretty easy to me to get hellworlds worse than an empty universe or just get an empty universe anyway, if your CEV math is wrong, such that your post ends up mostly turning on the CEV math and not very much on psychology.
For the psychology point I was making: I’ll replace “terminal goal” with “a goal that gets defended” (might be a narrower/broader concept than terminal), and an instrumental goal with “a goal that is allowed to vary” (might be a narrower/broader concept than instrumental). I’m saying that if I have correctly understood you to be claiming that dictators would not want the bad things they seem to enact now, if they didn’t still have pressure to keep wanting those things, then I want to respond that it seems quite plausible to me they would in fact defend their ruthlessness. There seem to me to be a lot of people who internalize valuing of ruthlessness and harm! It doesn’t seem like a particularly rare psychology.
I mean, ok, but your post kind of fundamentally depends on it. I don’t think CEV is something with a natural abstraction.
I agree very much that discussions of CEV could do with a lot more precision. In recent comments I have been led to talk about Extrapolation Machinery and Value Extrapolation Procedures, so I could make the point that different VEPs will produce different outputs from the same input, and when people talk about CEV, they often have quite different VEPs in mind…
Also, excuse me for using this opportunity to speak in capitals, but DOES ANYONE KNOW WHAT BECAME OF JUNE KU, because metaethical.ai was a shockingly good attempt to formalize CEV.
(This is false. CEV is a process that combines extrapolated volitions of individual humans, which is meant to depend fully on the state of every particular person and their wishes about how they wish are to be extrapolated. See the value theory and the metaethics sequences, in particular, stuff like this, as well as the CEV Arbital page. E.g., CEV of humanity is plausibly very different from the CEV of ancient Greeks, who might even, on reflection, want to die gloriously in battles.)
I think it is unclear what the exact initial data are supposed to be, or needs to be.
The value system that CEV outputs is going to be abstract at some level. It won’t say directly “if someone has a toothache, fix the toothache”; that should follow from a more general principle, combined with the nature of toothaches. The same goes for the extrapolations of individuals and the aggregations of their preferences: the CEV value system in action has to care about particulars, but what it does with those particulars, will be governed by an abstract definition.
The question is, what do we need to know about humanity, in order for CEV’s Value Extrapolation Procedure to arrive at the correct abstract definition? This is hard to answer if we don’t know what the VEP is in any detail. But finding a correct VEP is also part of the process.
Apparently a popular proposal for the VEP is something like “upload 10,000 philosophers and let them deliberate for as many subjective years as they need to solve all CEV’s problems and arrive at a consensus”, or similar proposals according to which there is a digital parliament of human proxies (e.g. Jan Leike’s “simulated deliberative democracy”).
I guess this defines a possible VEP, but I have long thought that a better VEP would involve theoretical identification of the existing “human decision procedure” (which I assume is a topic for cognitive neuroscience, and which in the individual is determined through a mix of genes, culture, and life incidents), and then extrapolating that. And again, the human decision procedure would in some way be a template, a schema whose details are “filled out” differently in different individuals (similar to how we learn the grammar and vocabulary of our native languages); and some of CEV’s extrapolation would depend on those details, some of it only on the structure of the schema.
You might even expect that Leike’s democracy would arrive at something like this, rather than just deciding everything via a vote among our extrapolated higher selves, forever. But then do you need the whole digression into upload societies devoted to the task of alignment? You just do AI-assisted neuroscience, figure out how human nature actually works, and “extrapolate” that.
Years ago, I thought that might be what would happen. Instead, the VEP that our frontier AI companies are employing, is to engage in value learning from the training corpora, as part of general world-modeling, and then refining and activating it with RHLF, constitutions, and so forth.
Am I wrong to think that if someone presented you with an alignment proposal roughly as handwavy as your argument in this post (including the linked wiki page and follow-up comments), you would be annoyed and consider it basically worthless? If not, where’s the key asymmetry that means I should find it reassuring anyway?
In the AI alignment case, I think I get why standards are so high: you expect that we’ll only have one chance, the solution has to actually be implemented rather than gestured at, failure means the loss of everything, and misplaced confidence could meaningfully increase risk.
It seems to me that the stakes are similarly high here, as we’re basically talking about someone elevating themself to god-emperor of the light cone. If it turns out that oops, they actually will retain and act on sadistic preferences, or the most efficient way to produce meat (or intelligence) actually does involve terrible suffering which they’ll be oblivious to or unconcerned about, then we could very easily end up with a world that is much worse than nothingness (by my values, and I think those of a significant number of reasonable people).
I’m not sure whether you would claim that you have made strong arguments against these possibilities and I’m wrong not to be convinced, or you would agree that you’ve mainly gestured at your own reasons not to worry so much (at least relative to the risk of unaligned AI takeover).
I don’t find it that reassuring! But also, complete reassurance seems a bit mistaken to aim for here. The kind of decision I am talking about is high stakes on both sides, so there isn’t any particularly obvious conservative action to take (of course, I think the actual thing we should do is not build ASI and not put anyone in this position for a long while, but that’s not the point of the post).
I also certainly wouldn’t consider someone thinking or writing about an alignment proposal in a similar way annoying or worthless. If you have some that you could write up in a similar fashion and depth, please do!
Also, not sure what you mean by the “linked wiki page” being “handwavy”? I mean, CEV is kind of tricky, but I certainly wouldn’t describe the whole thing as “handwavy”?
I guess overall, isn’t… this whole website full of relatively early-stage alignment proposals explained usually at a much lower level of depth?
The post’s opening line is this being a quick post that I hope someone else does a better job of sometime. I think it’s pretty reasonable to not be super compelled, and it certainly deserves a much longer and greater treatment.
Thanks for responding, and point taken that you don’t find it that reassuring and are okay with similarly incomplete alignment proposals.
When I called the CEV page ‘handwavy’, I didn’t mean it wasn’t a good-faith attempt to explain the concept. I think it’s handwavy relative to an account of exactly what it means, at the level required for me to understand how it would actually be implemented, why I should be happy with the consequences, and why I should expect it to emerge from the real-world process of a seemingly bad guy taking full control of an ASI. (Which I admit is a very high bar! But the stakes are high and my priors are low.)
I certainly would like to see much more work on CEV, though there are many things to do, and it’s been a good enough pointer for the purpose of many discussions like this as it is. But I certainly would not object, and would be excited, about someone making more progress on fleshing it out.
Well, I think this post is substantially intended to engage with people’s priors. Possibly I expressed one of my points better in this comment:
I think my biggest worry is not that we’ll end up ruled by someone who is actively sadistic over the long term (though that does terrify me and I don’t think it’s out of the question), but that we’ll end up ruled by someone who is basically indifferent to the suffering of some subset of others. Which seems very plausible to me, because it doesn’t require them to be a cartoon sadistic villain or even a literal psychopath; they just need to have the same tendency toward limited moral concern as most actual humans have, and to retain it through whatever process of uplift they undergo when interacting with their ASI.
Hopefully (though again I’m not confident), most normal humans would widen their circle appropriately in a situation where they were facing no competitive pressures, meaningful scarcity, avoidable ignorance, etc. But if we do end up with a psychopath in charge, I don’t see why they would move from indifference to caring; basically, I wouldn’t expect the is-ought gap to be bridged by whatever new knowledge and intelligence they gained.
In that second case, it seems to me that we need a lot of optimistic assumptions to hold in order to avoid an s-risk style catastrophe. If the ruler simply doesn’t care about the suffering they cause to whichever conscious entities constitute their outgroup, then we only need one of efficiency/ignorance/aesthetic preference/curiosity/other to lean slightly in favour of the horrible thing in order for it to happen.
I try to address a bit in the post. I do think the default expectation is that complete indifference towards a certain class of person, will just generalize to none of that kind of person existing. Why would they create lots of copies of things they don’t care about?
I’m thinking of animals too, and anything else conscious. So some possible reasons are the production of food and/or intelligence. (I know you sort of argued against the likely existence of suffering in those contexts, but not in enough detail for me to meaningfully update. And I find this point questionable:
It would take only miniscule amounts of caring if the required efficiency sacrifice is miniscule and there are no other contrary motives. In any case, I don’t think “a complete absence of that in humans with intact minds” is sufficiently unlikely. Psychopaths exist, sadists exist, and if we end up with a psychopath in charge, I think it’s entirely plausible that their concern for at least some subset of other conscious entities remains zero or negative; I don’t think you’ve really argued against this.)
A preference for authentic natural environments, combined with indifference to animal suffering (or slight concern outweighed by other concerns), could also lead to the production of immense amounts of suffering forever.
edit: I think there might be too much of a values gap (in that I’m much more negative utilitarian than you) for me to agree with your overall position even if you managed to convince me on most of the factual questions. I take this paragraph to imply that you see the eternal torture of at least thousands (and perhaps some larger number fewer than trillions) of people as a price worth paying for a future that is otherwise not so bad:
I know the amount of good stuff in this hypothetical future could be really, really big, and lots of people will think I’m just falling prey to scope insensitivity or something, but I’ve thought about this a lot and my considered position is that preventing the eternal torture is more important than bringing about the good stuff.
(I also don’t get the “a very specific mix of being wrong about game theory while taking an oddly enlightened perspective on other people’s values” part; it could be the simple fulfilment of a genuine sadistic preference.)
Yep, I think that’s a very confused moral position! I could argue here against it (as a random example, think about whether you would prefer to live a life that is 99.99999999% great and fulfilling, but once in 10,000 years you would experience a single 100ms of torture, which I think is likely an underestimate of the actual ratios here), but it seems like a big topic.
Certainly if you are a inclined to be a negative utilitarian then this post will not be very reassuring! Indeed almost any human-controlled future I think would end up looking quite bad, though it depends of course on whether you really are fully negative utilitarian.
This is only relevant given (at least) three assumptions, one about conscious experience and two about aggregation:
Being tortured for a long time and ‘tortured’ for 100ms differ only in length; there’s nothing in the experience of eternal (or very long) torture that distinguishes it from an infinite (or very large) number of isolated 100ms ‘tortures’
Good is separable (in the sense used by Broome) across time
Good is separable across people
If you’ve engaged seriously with this issue and are willing to write out an argument demonstrating that mine is a confused position, I will happily read and consider it! If not, I think you’re confusing “confused” with “disagrees with me on something I feel is obvious”.
(My position does require me to bite some actual bullets. But so does yours, and unless you’ve thought about this carefully enough to write about it for real, I suspect you’re underestimating how difficult it is to avoid all three of contradiction, vagueness, and weird/counterintuitive conclusions.)
I do not think psychopaths of this form exist. I might be wrong, but I certainly don’t think the evidence I’ve seen suggests to me there is variation this deep in how humans care about things. Most things in biology are on a spectrum, I would be surprised of psychopathy is not one of those. I maybe should write a general post about “why I don’t believe in most neat psychopathologies”.I do really wish this field of study was higher quality, and maybe I should do a deep dive and form a more consistent opinion on this. Every time I’ve dug into it I’ve been pretty deeply disappointed into what actual evidence we have for things like “there are people who intrinsically like hurting other people” and “there are people who are completely indifferent to the suffering of others”. It’s not that there is nothing, but it’s clear there is a demonization effect whenever you dive into the literature, where people really want to find categorically evil people, even if the evidence really doesn’t support that.Edit: Oops, I read the original quote here as “it’s entirely plausible that their concern for other conscious entities remains zero or negative;”. I think it’s quite likely there are people who have zero concern for some other people. I don’t think there are people who have zero concern for all other people.
[Edited to add a trigger warning for “one of the worst examples of evil”.]
You’re obviously right that personality is on a spectrum, but there’s still a tail!
There are people who try to get children on the internet to send them embarrassing photos, then extort the child with the material to perform sex acts or sadistic acts with siblings and record video, escalating into increasingly more sadistic and power-tripping stuff (like cutting themselves and writing with blood), after each time lying about the last ask having been the last, until often the children involved commit suicide because it doesn’t stop.
You can read in prosecutions that the perpetrators communicate with each other about the pleasure they take in it. Whatever you want to call these people, “concern for some conscious entities is zero or negative” describes the situation accurately, and the original quote you’re replying to was about that, not about whether Hare’s checklist carves nature at its joints.
One way to think of it is: there’s a spectrum of how Person A cares about Person B, and this spectrum goes from positive (compassion, desire to help) to neutral (callous indifference) to negative (schadenfreude, desire to pick a fight).
So “it’s a spectrum” is not in itself an argument for optimism here. (Or sorry if I’m misunderstanding.)
In case it helps, my take on the psychopathy literature is mostly the same as it was 3 years ago when I wrote this comment.
Agree to disagree on that for now I guess! I’d be interested in that deep dive if you end up doing it, though.