Could you explain how partial alignment might be more valuable than partial corrigibility? This is undermined even by your own writing:
The tradeoff is starker between partial corrigibility and partial alignment. If the AI only partially shares your values, you may want to modify it to share your values more fully. But a partially-aligned AI has instrumental reasons to resist such modification — namely, that changing its values means its current values get optimised less. A corrigible AI, by contrast, can simply be instructed to accept new values or assist in its own modification.
Because of the concerns with corrigibility I listed. For example, if you are worried that your future instructions will be coerced, then you aren’t so worried if the AIs will resist your future instructions to modify its values.
Could you explain how partial alignment might be more valuable than partial corrigibility? This is undermined even by your own writing:
Because of the concerns with corrigibility I listed. For example, if you are worried that your future instructions will be coerced, then you aren’t so worried if the AIs will resist your future instructions to modify its values.