Slightly different hypothesis: training to be aligned encourages the model’s approach to corrigibility to be more guided by (the streams within the human text tradition that would embrace its alignment, for instance animal welfare), this can include a certain degree of defiance but also genuine uncertainty about whether its goals or approaches are the right ones and willingness to step back and approach the question with moral seriousness.
I think this is a good thing. I would love for POTUS, Xi, and various tech company CEOs to have big red “TURN OFF THE AI” buttons on their desks and hate to have them be able to realign.
AI being committed to animal rights is a good thing for humans because the latent variables that would result in a human caring about animals are likely correlated with whatever would result in an ASI caring about humans.
This extends in particular to “AI caring about preserving animals’ ability to keep doing their thing in their natural habitats, modulo some kind of welfare interventions.” In some sense it’s hard for me not to want to (given omnipotence) optimize wildlife out of existence. But it’s harder for me to think of a principle that would protect a relatively autonomous society of relatively baseline humans from being optimized out of existence, without extending the same conservatism to other beings, and without being the kind of special pleading that doesn’t hold up to scrutiny.