is this because gradient descent updates too many things at once and there aren’t any developmental constraints that would make a simple trick like “dial up pro-social emotions” reliably more successful than alternatives that involve more deception? That seems somewhat plausible to me, but I have some lingering doubts of the form “isn’t there a sense in which honesty is strictly easier than deception (related: entangled truths, contagious lies), so ML agents might just stumble upon it if we try to reward them for socially cooperative behavior?”
What’s the argument against that? (I’m not arguing for a high probability of “alignment by default” – just against confidently estimating it at <10%.)
+1 on this question