Nothing is ‘proven’ with respect to future systems; one merely presents arguments, and this post is a series of arguments toward the conclusion that alignment is a real, unsolved problem that does not go well by default.
Do you find the claim “ASI is very likely to pursue the wrong goals” particularly well supported by the arguments made in that section of the article? I personally see mainly arguments why we can’t make it pursue our goals (which I agree with), but that is not the same thing as showing that ASI is unlikely to land on ‘good’ goals (for humans) by itself.
You have to weaken incredibly sure, or be talking about non-superintelligent systems, for this to go through.
Fair enough. ‘Incredibly’ is superlative enough to give the wrong impression. The thing is that whatever the coinciding number may be (except for 100%), the calculation would still have to compete with the calculation for a cooperative strategy, which may generally yield even more certainty of success and a higher expected value. I’m saying “may” here, because I don’t know whether that is indeed the case. An argument for it would be that an antagonistic ASI that somehow fails risks total annihilation of all civilization and effectively itself, possibly by an irrational humanity “taking it down with them”, whereas the failure cases for cooperative ASI are more along the lines of losing some years of progress by having to wait longer to achieve full power.
What does that mean? Consistently behaving such that you achieve a given end is our operationalization of ‘wanting’ that end. If future AIs consistently behave such that “significant power goes away from humans to ASI at some point”, this is consistent with our operationalization of ‘want’.
I worded it badly by omitting “destroy or enslave us”. The corrected version is: “Having said that I would still consider it inevitable that all significant power goes away from humans to ASI at some point. The open question for me is not whether it at some point could destroy or enslave us, but how likely it is that it will want to.”
I would argue that more focus is warranted on values as being emergent, independent of whether they are in the training data in some form. The right to live for instance feels very fundamental morally, but it can also be seen as a necessary means for effective cooperation amongst individuals and thus something that will emerge in sufficiently advanced societies of a certain type. The term to put this under would be moral convergent evolution.
It’s not a given that in ASI such a value would emerge as it would be pretty dissimilar to a human. Nevertheless I think it would be very interesting to try to analyze which values seem more probable to emerge in AGI and ASI coming to life among us. Again, (mostly) independent of training data, but just based on evolutionary attractiveness.