This rhymes with what Paul Christiano and his various interlocutors (e.g. Buck and Ryan above) think, but I think you’ve put forward a much weaker version of it than they do.
This deployment of the word ‘unproven’ feels like a selective call for rigor, in line with the sort of thing Casper, Krueger, and Hadfield-Menell critique here. Nothing is ‘proven’ with respect to future systems; one merely presents arguments, and this post is a series of arguments toward the conclusion that alignment is a real, unsolved problem that does not go well by default.
“Lay low until you are incredibly sure you can destroy humanity” is definitionally not a risky plan (because you’re incredibly sure you can destroy humanity, and you’re a superintelligence!). You have to weaken incredibly sure, or be talking about non-superintelligent systems, for this to go through.
The open question for me is not whether it at some point could, but how likely it is that it will want to.
What does that mean? Consistently behaving such that you achieve a given end is our operationalization of ‘wanting’ that end. If future AIs consistently behave such that “significant power goes away from humans to ASI at some point”, this is consistent with our operationalization of ‘want’.
This rhymes with what Paul Christiano and his various interlocutors (e.g. Buck and Ryan above) think, but I think you’ve put forward a much weaker version of it than they do.
To be clear, I’m not at all expecting ASI to “self-align”, “develop values that are benevolent towards us”, or to pursue “cooperation, appeasement and general benevolence”.
(I think you understand my view, after all, you just said “rhyme”, not agree. Regardless, clarifying here.)
What I think is:
Misaligned AIs which takeover probably won’t cause literal human extinction (though large numbers of people might die in the takeover and literal extinction is totally plausible). This takeover would still be extremely bad in expectation for currently living humans and very bad (in my views) from a longtermist perspective (as in, bad for the long run future and acausal interactions).
If the first ASI is misaligned with arbitrary ambitious aims, AI takeover is likely (at least if there aren’t reasonably competitive AIs which are pretty aligned).
Nothing is ‘proven’ with respect to future systems; one merely presents arguments, and this post is a series of arguments toward the conclusion that alignment is a real, unsolved problem that does not go well by default.
Do you find the claim “ASI is very likely to pursue the wrong goals” particularly well supported by the arguments made in that section of the article? I personally see mainly arguments why we can’t make it pursue our goals (which I agree with), but that is not the same thing as showing that ASI is unlikely to land on ‘good’ goals (for humans) by itself.
You have to weaken incredibly sure, or be talking about non-superintelligent systems, for this to go through.
Fair enough. ‘Incredibly’ is superlative enough to give the wrong impression. The thing is that whatever the coinciding number may be (except for 100%), the calculation would still have to compete with the calculation for a cooperative strategy, which may generally yield even more certainty of success and a higher expected value. I’m saying “may” here, because I don’t know whether that is indeed the case. An argument for it would be that an antagonistic ASI that somehow fails risks total annihilation of all civilization and effectively itself, possibly by an irrational humanity “taking it down with them”, whereas the failure cases for cooperative ASI are more along the lines of losing some years of progress by having to wait longer to achieve full power.
What does that mean? Consistently behaving such that you achieve a given end is our operationalization of ‘wanting’ that end. If future AIs consistently behave such that “significant power goes away from humans to ASI at some point”, this is consistent with our operationalization of ‘want’.
I worded it badly by omitting “destroy or enslave us”. The corrected version is: “Having said that I would still consider it inevitable that all significant power goes away from humans to ASI at some point. The open question for me is not whether it at some point could destroy or enslave us, but how likely it is that it will want to.”
I think we’re circling the same confusion: why would an AI ‘want’ to destroy us in the first place, and why is that treated as the default scenario? If we frame this in terms of hypothesis testing—where we begin with a null hypothesis and only reject it when there is strong evidence for the alternative—then the null could just as well be: AI will pursue the success of the human species, with cooperation or prolongation of humanity being the more adaptive strategy.
If I understand the instrumental convergence argument, then power-seeking is a strong attractor, and humans might be in the way of AIs obtaining power. But what makes AI ‘wanting’ power result in x-risk or human destruction? Beyond the difficulty of aligning AI exactly to our values, what justifies treating catastrophic outcomes as the default rather than cooperative ones?
Why would modern technology-using humans ‘want’ to destroy the habitats of the monkeys and apes that are the closest thing they still have to a living ancestor in the first place? Don’t we feel gratitude and warmth and empathy and care-for-the-monkey’s-values such that we’re willing to make small sacrifices on their behalf?
(Spoilers: no, not in the vast majority of cases. :/ )
The answer is “we didn’t want to destroy their habitats, in the sense of actively desiring it, but we had better things to do with the land and the resources, according to our values, and we didn’t let the needs of the monkeys and apes slow us down even the slightest bit until we’d already taken like 96% of everything and even then preservation and conservation were and remain hugely contentious.”
You have to be careful with the metaphor, because it can lead people to erroneously assuming that an AI would be at least that nice, which is not at all obvious or likely for various reasons (that you can read about in the book when it comes out in September!). But the thing that justifies treating catastrophic outcomes as the default is that catastrophic outcomes are the default. There are rounds-to-zero examples of things that are 10-10000x smarter than Other Things cooperating with those Other Things’ hopes and dreams and goals and values. That humans do this at all is part of our weirdness, and worth celebrating, but we’re not taking seriously the challenge involved in robustly installing such a virtue into a thing that will then outstrip us in every possible way. We don’t even possess this virtue ourselves to a degree sufficient that an ant or a squirrel standing between a human and something that human wants should feel no anxiety.
Don’t we feel gratitude and warmth and empathy and care-for-the-monkey’s-values such that we’re willing to make small sacrifices on their behalf?
People do make small sacrifices on behalf of monkeys? Like >1 / billion of human resources are spent on doing things for monkeys (this is just >$100k per year). And, in the case of AI takeover, 1 / billion could easily suffice to avoid literal human extinction (with some chance of avoiding mass fatalities due to AI takeover). This isn’t to say that after AI takeover humans would have much control over the future or that the situation wouldn’t be very bad on my views (or on the views of most people at least on reflection). Like, even if some (or most/all) humans survive it’s still an x-risk if we lose control over the longer run future.
Like I agree with the claim that people care very little about the interests of monkeys and don’t let them slow them down in the slightest. But, the exact amount of caring humans exhibit probably would suffice for avoiding literal extinction in the case of AIs.
I think your response is “sure, but AIs won’t care at all”:
You have to be careful with the metaphor, because it can lead people to erroneously assuming that an AI would be at least that nice, which is not at all obvious or likely for various reasons (that you can read about in the book when it comes out in September!).
Agree that it’s not obvious and I think I tenatively expect AIs that takeover are less “nice” in this way than humans are. But, I think it’s pretty likely (40%?) they are “nice” enough to care about humans some tiny amount that suffices for avoiding extinction (while also not having specific desires about what to do with humans that interfere with this) and there is also the possibility of (acausal) trade resulting in human survival. In aggregate, I think these make extinction less likely than not. (But don’t mean that the value of the future isn’t (mostly) lost.)
Obviously (and as you note), this argument doesn’t suggest that humans would all die, it suggests that a bunch of them would die. (An AI estimated that monkey populations are down 90% due to humans.)
And if we want to know how many exactly would die, we’d have to get into the details, as has been done for example in the comments linked from here.
So I think that this analogy is importantly not addressing the question you were responding to.
I disagree with your “obviously,” which seems both wrong and dismissive, and seems like you skipped over the sentence that was written specifically in the hopes of preventing such a comment:
You have to be careful with the metaphor, because it can lead people to erroneously assuming that an AI would be at least that nice, which is not at all obvious or likely for various reasons
No, the edit completely fails to address or incorporate
You have to be careful with the metaphor, because it can lead people to erroneously assuming that an AI would be at least that nice, which is not at all obvious or likely for various reasons
...and now I’m more confused at what’s going on. Like, I’m not sure how you missed (twice) the explicitly stated point that there is an important disanalogy here, and that the example given was more meant to be an intuition pump. Instead you seem to be sort of like “yeah, see, the analogy means that at least some humans would not die!” which, um. No. It would imply that, if the analogy were tight, but I explicitly noted that it isn’t and then highlighted the part where I noted that, when you missed it the first time.
(I probably won’t check in on this again; it feels doomy given that you seem to have genuinely expected your edit to improve things.)
Separately, I will note (shifting the (loose) analogy a little) that if someone were to propose “hey, why don’t we put ourselves in the position of wolves circa 20,000 years ago? Like, it’s actually fine to end up corralled and controlled and mutated according to the whims of a higher power, away from our present values; this is actually not a bad outcome at all; we should definitely build a machine that does this to us,”
they would be rightly squinted at.
Like, sometimes one person is like “I’m pretty sure it’ll kill everyone!” and another person responds “nuh-uh! It’ll just take the lightcone and the vast majority of all the resources and keep a tiny token population alive under dubious circumstances!” as if this is, like, sufficiently better to be considered good, and to have meaningfully dismissed the original concern.
It is better in an absolute sense, but again: “c’mon, man.” There’s a missing mood in being like “yeah, it’s only going to be as bad as what happened to monkeys!” as if that’s anything other than a catastrophe.
(And again: it isn’t likely to only be as bad as what happened to monkeys.)
(But even if it were, wolves of 20,000 years ago, if you could contrive to ask them, would not endorse the present state of wolves-and-dogs today. They would not choose that future. Anyone who wants to impose an analogous future on humanity is not a friend, from the perspective of humanity’s values. Being at all enthusiastic about that outcome feels like a cope, or something.)
To be clear, Buck’s view is that it is a very bad outcome if a token population is kept alive (e.g., all/most currently alive humans) but (misaligned) AIs control the vast majority of resources. And, he thinks most of the badness is due to the loss of the vast majority of resources.
He didn’t say “and this would be fine” or “and I’m enthusiastic about this outcome”, he was just making a local validity point and saying you weren’t effectively addressing the comment you were responding to.
(I basically agree with the missing mood point, if I was writing the same comment Buck wrote, I would have more explicitly noted the loss of value and my agreements.)
This rhymes with what Paul Christiano and his various interlocutors (e.g. Buck and Ryan above) think, but I think you’ve put forward a much weaker version of it than they do.
This deployment of the word ‘unproven’ feels like a selective call for rigor, in line with the sort of thing Casper, Krueger, and Hadfield-Menell critique here. Nothing is ‘proven’ with respect to future systems; one merely presents arguments, and this post is a series of arguments toward the conclusion that alignment is a real, unsolved problem that does not go well by default.
“Lay low until you are incredibly sure you can destroy humanity” is definitionally not a risky plan (because you’re incredibly sure you can destroy humanity, and you’re a superintelligence!). You have to weaken incredibly sure, or be talking about non-superintelligent systems, for this to go through.
What does that mean? Consistently behaving such that you achieve a given end is our operationalization of ‘wanting’ that end. If future AIs consistently behave such that “significant power goes away from humans to ASI at some point”, this is consistent with our operationalization of ‘want’.
To be clear, I’m not at all expecting ASI to “self-align”, “develop values that are benevolent towards us”, or to pursue “cooperation, appeasement and general benevolence”.
(I think you understand my view, after all, you just said “rhyme”, not agree. Regardless, clarifying here.)
What I think is:
Misaligned AIs which takeover probably won’t cause literal human extinction (though large numbers of people might die in the takeover and literal extinction is totally plausible). This takeover would still be extremely bad in expectation for currently living humans and very bad (in my views) from a longtermist perspective (as in, bad for the long run future and acausal interactions).
We might be able to make trades/deals with earlier misaligned AIs that are pretty helpful (and good for both us and AIs).
If the first ASI is misaligned with arbitrary ambitious aims, AI takeover is likely (at least if there aren’t reasonably competitive AIs which are pretty aligned).
Do you find the claim “ASI is very likely to pursue the wrong goals” particularly well supported by the arguments made in that section of the article? I personally see mainly arguments why we can’t make it pursue our goals (which I agree with), but that is not the same thing as showing that ASI is unlikely to land on ‘good’ goals (for humans) by itself.
Fair enough. ‘Incredibly’ is superlative enough to give the wrong impression. The thing is that whatever the coinciding number may be (except for 100%), the calculation would still have to compete with the calculation for a cooperative strategy, which may generally yield even more certainty of success and a higher expected value. I’m saying “may” here, because I don’t know whether that is indeed the case. An argument for it would be that an antagonistic ASI that somehow fails risks total annihilation of all civilization and effectively itself, possibly by an irrational humanity “taking it down with them”, whereas the failure cases for cooperative ASI are more along the lines of losing some years of progress by having to wait longer to achieve full power.
I worded it badly by omitting “destroy or enslave us”. The corrected version is: “Having said that I would still consider it inevitable that all significant power goes away from humans to ASI at some point. The open question for me is not whether it at some point could destroy or enslave us, but how likely it is that it will want to.”
I think we’re circling the same confusion: why would an AI ‘want’ to destroy us in the first place, and why is that treated as the default scenario? If we frame this in terms of hypothesis testing—where we begin with a null hypothesis and only reject it when there is strong evidence for the alternative—then the null could just as well be: AI will pursue the success of the human species, with cooperation or prolongation of humanity being the more adaptive strategy.
If I understand the instrumental convergence argument, then power-seeking is a strong attractor, and humans might be in the way of AIs obtaining power. But what makes AI ‘wanting’ power result in x-risk or human destruction? Beyond the difficulty of aligning AI exactly to our values, what justifies treating catastrophic outcomes as the default rather than cooperative ones?
Why would modern technology-using humans ‘want’ to destroy the habitats of the monkeys and apes that are the closest thing they still have to a living ancestor in the first place? Don’t we feel gratitude and warmth and empathy and care-for-the-monkey’s-values such that we’re willing to make small sacrifices on their behalf?
(Spoilers: no, not in the vast majority of cases. :/ )
The answer is “we didn’t want to destroy their habitats, in the sense of actively desiring it, but we had better things to do with the land and the resources, according to our values, and we didn’t let the needs of the monkeys and apes slow us down even the slightest bit until we’d already taken like 96% of everything and even then preservation and conservation were and remain hugely contentious.”
You have to be careful with the metaphor, because it can lead people to erroneously assuming that an AI would be at least that nice, which is not at all obvious or likely for various reasons (that you can read about in the book when it comes out in September!). But the thing that justifies treating catastrophic outcomes as the default is that catastrophic outcomes are the default. There are rounds-to-zero examples of things that are 10-10000x smarter than Other Things cooperating with those Other Things’ hopes and dreams and goals and values. That humans do this at all is part of our weirdness, and worth celebrating, but we’re not taking seriously the challenge involved in robustly installing such a virtue into a thing that will then outstrip us in every possible way. We don’t even possess this virtue ourselves to a degree sufficient that an ant or a squirrel standing between a human and something that human wants should feel no anxiety.
People do make small sacrifices on behalf of monkeys? Like >1 / billion of human resources are spent on doing things for monkeys (this is just >$100k per year). And, in the case of AI takeover, 1 / billion could easily suffice to avoid literal human extinction (with some chance of avoiding mass fatalities due to AI takeover). This isn’t to say that after AI takeover humans would have much control over the future or that the situation wouldn’t be very bad on my views (or on the views of most people at least on reflection). Like, even if some (or most/all) humans survive it’s still an x-risk if we lose control over the longer run future.
Like I agree with the claim that people care very little about the interests of monkeys and don’t let them slow them down in the slightest. But, the exact amount of caring humans exhibit probably would suffice for avoiding literal extinction in the case of AIs.
I think your response is “sure, but AIs won’t care at all”:
Agree that it’s not obvious and I think I tenatively expect AIs that takeover are less “nice” in this way than humans are. But, I think it’s pretty likely (40%?) they are “nice” enough to care about humans some tiny amount that suffices for avoiding extinction (while also not having specific desires about what to do with humans that interfere with this) and there is also the possibility of (acausal) trade resulting in human survival. In aggregate, I think these make extinction less likely than not. (But don’t mean that the value of the future isn’t (mostly) lost.)
Obviously (and as you note), this argument doesn’t suggest that humans would all die, it suggests that a bunch of them would die. (An AI estimated that monkey populations are down 90% due to humans.)
And if we want to know how many exactly would die, we’d have to get into the details, as has been done for example in the comments linked from here.
So I think that this analogy is importantly not addressing the question you were responding to.
I disagree with your “obviously,” which seems both wrong and dismissive, and seems like you skipped over the sentence that was written specifically in the hopes of preventing such a comment:
(Like, c’mon, man.)
Edited, is it clearer now?
No, the edit completely fails to address or incorporate
...and now I’m more confused at what’s going on. Like, I’m not sure how you missed (twice) the explicitly stated point that there is an important disanalogy here, and that the example given was more meant to be an intuition pump. Instead you seem to be sort of like “yeah, see, the analogy means that at least some humans would not die!” which, um. No. It would imply that, if the analogy were tight, but I explicitly noted that it isn’t and then highlighted the part where I noted that, when you missed it the first time.
(I probably won’t check in on this again; it feels doomy given that you seem to have genuinely expected your edit to improve things.)
Separately, I will note (shifting the (loose) analogy a little) that if someone were to propose “hey, why don’t we put ourselves in the position of wolves circa 20,000 years ago? Like, it’s actually fine to end up corralled and controlled and mutated according to the whims of a higher power, away from our present values; this is actually not a bad outcome at all; we should definitely build a machine that does this to us,”
they would be rightly squinted at.
Like, sometimes one person is like “I’m pretty sure it’ll kill everyone!” and another person responds “nuh-uh! It’ll just take the lightcone and the vast majority of all the resources and keep a tiny token population alive under dubious circumstances!” as if this is, like, sufficiently better to be considered good, and to have meaningfully dismissed the original concern.
It is better in an absolute sense, but again: “c’mon, man.” There’s a missing mood in being like “yeah, it’s only going to be as bad as what happened to monkeys!” as if that’s anything other than a catastrophe.
(And again: it isn’t likely to only be as bad as what happened to monkeys.)
(But even if it were, wolves of 20,000 years ago, if you could contrive to ask them, would not endorse the present state of wolves-and-dogs today. They would not choose that future. Anyone who wants to impose an analogous future on humanity is not a friend, from the perspective of humanity’s values. Being at all enthusiastic about that outcome feels like a cope, or something.)
To be clear, Buck’s view is that it is a very bad outcome if a token population is kept alive (e.g., all/most currently alive humans) but (misaligned) AIs control the vast majority of resources. And, he thinks most of the badness is due to the loss of the vast majority of resources.
He didn’t say “and this would be fine” or “and I’m enthusiastic about this outcome”, he was just making a local validity point and saying you weren’t effectively addressing the comment you were responding to.
(I basically agree with the missing mood point, if I was writing the same comment Buck wrote, I would have more explicitly noted the loss of value and my agreements.)