I agree that in the limit training the AI to act human-like is going to break down, and that a superintelligence can’t be particularly human-like even by definition. However, I’m not sure this matters.
The question is how far the “domain of anthropomorphism” will stretch. I think it’s obvious that it could stretch far enough to get something which would be transformative to the extent that the AIs would be better at almost all intellectual tasks that humans (inc. e.g. long-term-planning), on the basis that there exist actual human people such that if there were millions of copies of them running tirelessly at increased speed and with perfect coordination they could do this; Amodei’s “country of geniuses in a datacenter”. Whether you think it actually will is maybe where we disagree. I don’t think we’ve already gone beyond that domain, because while existing AIs are “superhuman” in some respects they act extremely human-like in general (perhaps this turns on how close you think “close” is to acting human-like). My expectation is that we probably also won’t go beyond it until some time after we have “AGI” (whatever that means); I agree that after that point we’ll have to come up with a new strategy, but I think that “we” in that case will mean the millions of AI instances dedicated to solving alignment issues. Since I’m only interested in strategies that humans will design and implement, I consider that largely out-of-scope.
If it’s true that targeting human-like behaviour is impossible to do while training for long-term thinking ability and it’s not realistic to get it “back on track” before human-level AI, I agree that that would mean that it would be correct to consider more strongly alignment strategies that would otherwise compromise human-like behaviour (if not corrigibility specifically).
Very good point re: sainthood being inhuman; I hadn’t really considered that and it does seem somewhat problematic (and post-hoc it does seem like there might already be signs of weirdness which could conceivably be a result of this). It does seem to me that “me but a saint” is way less weird than “me but corrigible”, but this is a very vague feeling and might not be true of all people (i.e. many cultures throughout history have considered it virtuous to submit to one’s parents even when they are asking insane things of you; “make the AI Confucian and have a pathological deference to human authority” is maybe not actually so far away from some plausible human mindset. That said, I think one would want to do this in a holistic way, and “American liberal except with a level of filial piety never seen before on Earth” is not gonna cut it).
WRT safety and kindness: yeah, I’ll admit that I might be overindexing on how I expect people to immediately act, rather than how they might act in a hundred years. There’s some weirdness here about the changes in the principal’s values being causally downstream of the AI’s actions (since the AI is the one making them safe), but I haven’t thought deeply enough about it to have a strong opinion about whether that’s a real problem. If it’s inevitable that eventually they’ll be asked to be made smarter, and humans made smarter inevitably all want roughly the same stuff (and it’s good stuff for everyone else), then we’re fine; I’m pretty skeptical that this is the case, though, and especially here there’s a worry that a truly corrigible AI will enable “make me smarter but, like, want the same stuff as I do right now, don’t let being smarter change me in any important way” which makes the whole thing moot.
With regards to sadism, I think it is extremely extremely common, maybe bordering on universal, for humans to desire some things to suffer. I don’t really have a great deal of faith that this goes away fully as people get smarter or safer. Of course depending on your position on suffering you might not think of this as necessarily a problem, but I think that there is so much variety in what exact circumstances people want suffering that I think the end state is likely to be repulsive even to most people who do not grant suffering primacy in their ethic.