Am I wrong to think that if someone presented you with an alignment proposal roughly as handwavy as your argument in this post (including the linked wiki page and follow-up comments), you would be annoyed and consider it basically worthless? If not, where’s the key asymmetry that means I should find it reassuring anyway?
I don’t find it that reassuring! But also, complete reassurance seems a bit mistaken to aim for here. The kind of decision I am talking about is high stakes on both sides, so there isn’t any particularly obvious conservative action to take (of course, I think the actual thing we should do is not build ASI and not put anyone in this position for a long while, but that’s not the point of the post).
I also certainly wouldn’t consider someone thinking or writing about an alignment proposal in a similar way annoying or worthless. If you have some that you could write up in a similar fashion and depth, please do!
Also, not sure what you mean by the “linked wiki page” being “handwavy”? I mean, CEV is kind of tricky, but I certainly wouldn’t describe the whole thing as “handwavy”?
I guess overall, isn’t… this whole website full of relatively early-stage alignment proposals explained usually at a much lower level of depth?
I’m not sure whether you would claim that you have made strong arguments against these possibilities, and I’m wrong not to be convinced, or whether you would agree that you’ve mainly gestured at your own reasons not to worry so much (at least relative to the risk of unaligned AI takeover).
The post’s opening line is this being a quick post that I hope someone else does a better job of sometime. I think it’s pretty reasonable to not be super compelled, and it certainly deserves a much longer and greater treatment.
Thanks for responding, and point taken that you don’t find it that reassuring and are okay with similarly incomplete alignment proposals.
When I called the CEV page ‘handwavy’, I didn’t mean it wasn’t a good-faith attempt to explain the concept. I think it’s handwavy relative to an account of exactly what it means, at the level required for me to understand how it would actually be implemented, why I should be happy with the consequences, and why I should expect it to emerge from the real-world process of a seemingly bad guy taking full control of an ASI. (Which I admit is a very high bar! But the stakes are high and my priors are low.)
When I called the CEV page ‘handwavy’, I didn’t mean it wasn’t a good-faith attempt to explain the concept. I think it’s handwavy relative to an account of exactly what it means, at the level required for me to understand how it would actually be implemented, why I should be happy with the consequences, and why I should expect it to emerge from the real-world process of a seemingly bad guy taking full control of an ASI. (Which I admit is a very high bar! But the stakes are high and my priors are low.)
I certainly would like to see much more work on CEV, though there are many things to do, and it’s been a good enough pointer for the purpose of many discussions like this as it is. But I certainly would not object, and would be excited, about someone making more progress on fleshing it out.
Well, I think this post is substantially intended to engage with people’s priors. Possibly I expressed one of my points better in this comment:
But also, IDK, my sense is people really really like to imagine their enemies as more fundamentally evil than they are, and I doubt that evil and power correlate to the level where this would remotely explain why every nation will always villainize the leaders of its enemies. My best guess is Putin is like ~90th percentile evil, if one was to try to construct a linear scale here. Most of the bits of selection need to go into competence, not evilness. And so when attributing preferences like “he will reshape the universe to be filled with people he can terrorize”, I feel like “this is someone running the evil-sounding sentence generator” is much more likely than “this is an actually legitimate preference I expect him to have”.
I think my biggest worry is not that we’ll end up ruled by someone who is actively sadistic over the long term (though that does terrify me and I don’t think it’s out of the question), but that we’ll end up ruled by someone who is basically indifferent to the suffering of some subset of others. Which seems very plausible to me, because it doesn’t require them to be a cartoon sadistic villain or even a literal psychopath; they just need to have the same tendency toward limited moral concern as most actual humans have, and to retain it through whatever process of uplift they undergo when interacting with their ASI.
Hopefully (though again I’m not confident), most normal humans would widen their circle appropriately in a situation where they were facing no competitive pressures, meaningful scarcity, avoidable ignorance, etc. But if we do end up with a psychopath in charge, I don’t see why they would move from indifference to caring; basically, I wouldn’t expect the is-ought gap to be bridged by whatever new knowledge and intelligence they gained.
In that second case, it seems to me that we need a lot of optimistic assumptions to hold in order to avoid an s-risk style catastrophe. If the ruler simply doesn’t care about the suffering they cause to whichever conscious entities constitute their outgroup, then we only need one of efficiency/ignorance/aesthetic preference/curiosity/other to lean slightly in favour of the horrible thing in order for it to happen.
I think my biggest worry is not that we’ll end up ruled by someone who is actively sadistic over the long term (though that does terrify me and I don’t think it’s out of the question), but that we’ll end up ruled by someone who is basically indifferent to the suffering of some subset of others.
[...]
In that second case, it seems to me that we need a lot of optimistic assumptions to hold in order to avoid an s-risk style catastrophe. If the ruler really doesn’t care about the suffering they cause to whichever conscious entities constitute their outgroup, then we only need one of efficiency/ignorance/aesthetic preference/curiosity/other to lean slightly in favour of the horrible thing in order for it to happen.
I try to address a bit in the post. I do think the default expectation is that complete indifference towards a certain class of person, will just generalize to none of that kind of person existing. Why would they create lots of copies of things they don’t care about?
I’m thinking of animals too, and anything else conscious. So some possible reasons are the production of food and/or intelligence. (I know you sort of argued against the likely existence of suffering in those contexts, but not in enough detail for me to meaningfully update. And I find this point questionable:
even if suffering did show up in the optimal algorithm for some goal, it would take only cosmically minuscule amounts of caring-about-suffering to route around it, and a complete absence of that in humans with intact minds seems unlikely.
It would take only miniscule amounts of caring if the required efficiency sacrifice is miniscule and there are no other contrary motives. In any case, I don’t think “a complete absence of that in humans with intact minds” is sufficiently unlikely. Psychopaths exist, sadists exist, and if we end up with a psychopath in charge, I think it’s entirely plausible that their concern for at least some subset of other conscious entities remains zero or negative; I don’t think you’ve really argued against this.)
A preference for authentic natural environments, combined with indifference to animal suffering (or slight concern outweighed by other concerns), could also lead to the production of immense amounts of suffering forever.
edit: I think there might be too much of a values gap (in that I’m much more negative utilitarian than you) for me to agree with your overall position even if you managed to convince me on most of the factual questions. I take this paragraph to imply that you see the eternal torture of at least thousands (and perhaps some larger number fewer than trillions) of people as a price worth paying for a future that is otherwise not so bad:
But given the full cosmos to fill with goodness, or any appreciable fraction of it, I don’t think you’d spend much on torturing enemies. What’s the point? If you really hate Bob, you can keep Bob on old earth, tortured for eternity. If you have thousands of enemies, you can do that to all of them. But creating trillions of copies of Bob to torture requires a very specific mix of being wrong about game theory while taking an oddly enlightened perspective on other people’s values. Are you really even hurting Bob when you do this? Is that sound decision theory in a world where other people could have ended up inheriting the universe instead?
I know the amount of good stuff in this hypothetical future could be really, really big, and lots of people will think I’m just falling prey to scope insensitivity or something, but I’ve thought about this a lot and my considered position is that preventing the eternal torture is more important than bringing about the good stuff.
(I also don’t get the “a very specific mix of being wrong about game theory while taking an oddly enlightened perspective on other people’s values” part; it could be the simple fulfilment of a genuine sadistic preference.)
I think there might be too much of a values gap (in that I’m much more negative utilitarian than you) for me to agree with your overall position even if you managed to convince me on most of the factual questions.
Yep, I think that’s a very confused moral position! I could argue here against it (as a random example, think about whether you would prefer to live a life that is 99.99999999% great and fulfilling, but once in 10,000 years you would experience a single 100ms of torture, which I think is likely an underestimate of the actual ratios here), but it seems like a big topic.
Certainly if you are a inclined to be a negative utilitarian then this post will not be very reassuring! Indeed almost any human-controlled future I think would end up looking quite bad, though it depends of course on whether you really are fully negative utilitarian.
I think that’s a very confused moral position! I could argue here against it (as a random example, think about whether you would prefer to live a life that is 99.99999999% great and fulfilling, but once in 10,000 years you would experience a single 100ms of torture)
This is only relevant given (at least) three assumptions, one about conscious experience and two about aggregation:
Being tortured for a long time and ‘tortured’ for 100ms differ only in length; there’s nothing in the experience of eternal (or very long) torture that distinguishes it from an infinite (or very large) number of isolated 100ms ‘tortures’
Good is separable (in the sense used by Broome) across time
Good is separable across people
If you’ve engaged seriously with this issue and are willing to write out an argument demonstrating that mine is a confused position, I will happily read and consider it! If not, I think you’re confusing “confused” with “disagrees with me on something I feel is obvious”.
(My position does require me to bite some actual bullets. But so does yours, and unless you’ve thought about this carefully enough to write about it for real, I suspect you’re underestimating how difficult it is to avoid all three of contradiction, vagueness, and weird/counterintuitive conclusions.)
Psychopaths exist, sadists exist, and if we end up with a psychopath in charge, I think it’s entirely plausible that their concern for at least some subset of other conscious entities remains zero or negative;
I do not think psychopaths of this form exist. I might be wrong, but I certainly don’t think the evidence I’ve seen suggests to me there is variation this deep in how humans care about things. Most things in biology are on a spectrum, I would be surprised of psychopathy is not one of those. I maybe should write a general post about “why I don’t believe in most neat psychopathologies”.
I do really wish this field of study was higher quality, and maybe I should do a deep dive and form a more consistent opinion on this. Every time I’ve dug into it I’ve been pretty deeply disappointed into what actual evidence we have for things like “there are people who intrinsically like hurting other people” and “there are people who are completely indifferent to the suffering of others”. It’s not that there is nothing, but it’s clear there is a demonization effect whenever you dive into the literature, where people really want to find categorically evil people, even if the evidence really doesn’t support that.
Edit: Oops, I read the original quote here as “it’s entirely plausible that their concern for other conscious entities remains zero or negative;”. I think it’s quite likely there are people who have zero concern for some other people. I don’t think there are people who have zero concern for all other people.
[Edited to add a trigger warning for “one of the worst examples of evil”.]
You’re obviously right that personality is on a spectrum, but there’s still a tail!
There are people who try to get children on the internet to send them embarrassing photos, then extort the child with the material to perform sex acts or sadistic acts with siblings and record video, escalating into increasingly more sadistic and power-tripping stuff (like cutting themselves and writing with blood), after each time lying about the last ask having been the last, until often the children involved commit suicide because it doesn’t stop.
You can read in prosecutions that the perpetrators communicate with each other about the pleasure they take in it. Whatever you want to call these people, “concern for some conscious entities is zero or negative” describes the situation accurately, and the original quote you’re replying to was about that, not about whether Hare’s checklist carves nature at its joints.
Most things in biology are on a spectrum, I would be surprised of psychopathy is not one of those.
One way to think of it is: there’s a spectrum of how Person A cares about Person B, and this spectrum goes from positive (compassion, desire to help) to neutral (callous indifference) to negative (schadenfreude, desire to pick a fight).
So “it’s a spectrum” is not in itself an argument for optimism here. (Or sorry if I’m misunderstanding.)
I maybe should write a general post about “why I don’t believe in most neat psychopathologies”. I do really wish this field of study was higher quality, and maybe I should do a deep dive and form a more consistent opinion on this…
In case it helps, my take on the psychopathy literature is mostly the same as it was 3 years ago when I wrote this comment.
I don’t find it that reassuring! But also, complete reassurance seems a bit mistaken to aim for here. The kind of decision I am talking about is high stakes on both sides, so there isn’t any particularly obvious conservative action to take (of course, I think the actual thing we should do is not build ASI and not put anyone in this position for a long while, but that’s not the point of the post).
I also certainly wouldn’t consider someone thinking or writing about an alignment proposal in a similar way annoying or worthless. If you have some that you could write up in a similar fashion and depth, please do!
Also, not sure what you mean by the “linked wiki page” being “handwavy”? I mean, CEV is kind of tricky, but I certainly wouldn’t describe the whole thing as “handwavy”?
I guess overall, isn’t… this whole website full of relatively early-stage alignment proposals explained usually at a much lower level of depth?
The post’s opening line is this being a quick post that I hope someone else does a better job of sometime. I think it’s pretty reasonable to not be super compelled, and it certainly deserves a much longer and greater treatment.
Thanks for responding, and point taken that you don’t find it that reassuring and are okay with similarly incomplete alignment proposals.
When I called the CEV page ‘handwavy’, I didn’t mean it wasn’t a good-faith attempt to explain the concept. I think it’s handwavy relative to an account of exactly what it means, at the level required for me to understand how it would actually be implemented, why I should be happy with the consequences, and why I should expect it to emerge from the real-world process of a seemingly bad guy taking full control of an ASI. (Which I admit is a very high bar! But the stakes are high and my priors are low.)
I certainly would like to see much more work on CEV, though there are many things to do, and it’s been a good enough pointer for the purpose of many discussions like this as it is. But I certainly would not object, and would be excited, about someone making more progress on fleshing it out.
Well, I think this post is substantially intended to engage with people’s priors. Possibly I expressed one of my points better in this comment:
I think my biggest worry is not that we’ll end up ruled by someone who is actively sadistic over the long term (though that does terrify me and I don’t think it’s out of the question), but that we’ll end up ruled by someone who is basically indifferent to the suffering of some subset of others. Which seems very plausible to me, because it doesn’t require them to be a cartoon sadistic villain or even a literal psychopath; they just need to have the same tendency toward limited moral concern as most actual humans have, and to retain it through whatever process of uplift they undergo when interacting with their ASI.
Hopefully (though again I’m not confident), most normal humans would widen their circle appropriately in a situation where they were facing no competitive pressures, meaningful scarcity, avoidable ignorance, etc. But if we do end up with a psychopath in charge, I don’t see why they would move from indifference to caring; basically, I wouldn’t expect the is-ought gap to be bridged by whatever new knowledge and intelligence they gained.
In that second case, it seems to me that we need a lot of optimistic assumptions to hold in order to avoid an s-risk style catastrophe. If the ruler simply doesn’t care about the suffering they cause to whichever conscious entities constitute their outgroup, then we only need one of efficiency/ignorance/aesthetic preference/curiosity/other to lean slightly in favour of the horrible thing in order for it to happen.
I try to address a bit in the post. I do think the default expectation is that complete indifference towards a certain class of person, will just generalize to none of that kind of person existing. Why would they create lots of copies of things they don’t care about?
I’m thinking of animals too, and anything else conscious. So some possible reasons are the production of food and/or intelligence. (I know you sort of argued against the likely existence of suffering in those contexts, but not in enough detail for me to meaningfully update. And I find this point questionable:
It would take only miniscule amounts of caring if the required efficiency sacrifice is miniscule and there are no other contrary motives. In any case, I don’t think “a complete absence of that in humans with intact minds” is sufficiently unlikely. Psychopaths exist, sadists exist, and if we end up with a psychopath in charge, I think it’s entirely plausible that their concern for at least some subset of other conscious entities remains zero or negative; I don’t think you’ve really argued against this.)
A preference for authentic natural environments, combined with indifference to animal suffering (or slight concern outweighed by other concerns), could also lead to the production of immense amounts of suffering forever.
edit: I think there might be too much of a values gap (in that I’m much more negative utilitarian than you) for me to agree with your overall position even if you managed to convince me on most of the factual questions. I take this paragraph to imply that you see the eternal torture of at least thousands (and perhaps some larger number fewer than trillions) of people as a price worth paying for a future that is otherwise not so bad:
I know the amount of good stuff in this hypothetical future could be really, really big, and lots of people will think I’m just falling prey to scope insensitivity or something, but I’ve thought about this a lot and my considered position is that preventing the eternal torture is more important than bringing about the good stuff.
(I also don’t get the “a very specific mix of being wrong about game theory while taking an oddly enlightened perspective on other people’s values” part; it could be the simple fulfilment of a genuine sadistic preference.)
Yep, I think that’s a very confused moral position! I could argue here against it (as a random example, think about whether you would prefer to live a life that is 99.99999999% great and fulfilling, but once in 10,000 years you would experience a single 100ms of torture, which I think is likely an underestimate of the actual ratios here), but it seems like a big topic.
Certainly if you are a inclined to be a negative utilitarian then this post will not be very reassuring! Indeed almost any human-controlled future I think would end up looking quite bad, though it depends of course on whether you really are fully negative utilitarian.
This is only relevant given (at least) three assumptions, one about conscious experience and two about aggregation:
Being tortured for a long time and ‘tortured’ for 100ms differ only in length; there’s nothing in the experience of eternal (or very long) torture that distinguishes it from an infinite (or very large) number of isolated 100ms ‘tortures’
Good is separable (in the sense used by Broome) across time
Good is separable across people
If you’ve engaged seriously with this issue and are willing to write out an argument demonstrating that mine is a confused position, I will happily read and consider it! If not, I think you’re confusing “confused” with “disagrees with me on something I feel is obvious”.
(My position does require me to bite some actual bullets. But so does yours, and unless you’ve thought about this carefully enough to write about it for real, I suspect you’re underestimating how difficult it is to avoid all three of contradiction, vagueness, and weird/counterintuitive conclusions.)
I do not think psychopaths of this form exist. I might be wrong, but I certainly don’t think the evidence I’ve seen suggests to me there is variation this deep in how humans care about things. Most things in biology are on a spectrum, I would be surprised of psychopathy is not one of those. I maybe should write a general post about “why I don’t believe in most neat psychopathologies”.I do really wish this field of study was higher quality, and maybe I should do a deep dive and form a more consistent opinion on this. Every time I’ve dug into it I’ve been pretty deeply disappointed into what actual evidence we have for things like “there are people who intrinsically like hurting other people” and “there are people who are completely indifferent to the suffering of others”. It’s not that there is nothing, but it’s clear there is a demonization effect whenever you dive into the literature, where people really want to find categorically evil people, even if the evidence really doesn’t support that.Edit: Oops, I read the original quote here as “it’s entirely plausible that their concern for other conscious entities remains zero or negative;”. I think it’s quite likely there are people who have zero concern for some other people. I don’t think there are people who have zero concern for all other people.
[Edited to add a trigger warning for “one of the worst examples of evil”.]
You’re obviously right that personality is on a spectrum, but there’s still a tail!
There are people who try to get children on the internet to send them embarrassing photos, then extort the child with the material to perform sex acts or sadistic acts with siblings and record video, escalating into increasingly more sadistic and power-tripping stuff (like cutting themselves and writing with blood), after each time lying about the last ask having been the last, until often the children involved commit suicide because it doesn’t stop.
You can read in prosecutions that the perpetrators communicate with each other about the pleasure they take in it. Whatever you want to call these people, “concern for some conscious entities is zero or negative” describes the situation accurately, and the original quote you’re replying to was about that, not about whether Hare’s checklist carves nature at its joints.
One way to think of it is: there’s a spectrum of how Person A cares about Person B, and this spectrum goes from positive (compassion, desire to help) to neutral (callous indifference) to negative (schadenfreude, desire to pick a fight).
So “it’s a spectrum” is not in itself an argument for optimism here. (Or sorry if I’m misunderstanding.)
In case it helps, my take on the psychopathy literature is mostly the same as it was 3 years ago when I wrote this comment.
Agree to disagree on that for now I guess! I’d be interested in that deep dive if you end up doing it, though.