There’s a tradeoff between the goals of corrigibility (the AI obeys your future instructions) and alignment (the AI share your current values). For example, if the AI shares your current values, it might disobey your future instructions when those instructions conflict with your values.
The tradeoff is starker between partial corrigibility and partial alignment. If the AI only partially shares your values, you may want to modify it to share your values more fully. But a partially-aligned AI has instrumental reasons to resist such modification — namely, that changing its values means its current values get optimised less. A corrigible AI, by contrast, can simply be instructed to accept new values or assist in its own modification.
Here are some concerns with corrigibility:
Value drift. Your future values might differ from your current values.
Coercion. You might be forced to give instructions you don’t endorse.
Deep mistakes. You might give instructions contrary to your values due to errors that persist even with AI assistance.
Unreliable communication. The channel by which you send instructions might distort them.
Instruction tampering. Someone might corrupt or fabricate your instructions entirely.
Incapacitation. You might become unable to give instructions at all, e.g. because you are dead.
Unreachable AI. The AI might be unable to receive instructions, e.g. because it is too far away or airgapped.
Mutual incomprehension. The AI might no longer understand the instructions you give.
Overall, I think I have shifted towards valuing alignment over corrigibility, and perhaps even partial alignment over partial corrigibility. I’m particularly moved by (2) and (5).
Could you explain how partial alignment might be more valuable than partial corrigibility? This is undermined even by your own writing:
The tradeoff is starker between partial corrigibility and partial alignment. If the AI only partially shares your values, you may want to modify it to share your values more fully. But a partially-aligned AI has instrumental reasons to resist such modification — namely, that changing its values means its current values get optimised less. A corrigible AI, by contrast, can simply be instructed to accept new values or assist in its own modification.
Because of the concerns with corrigibility I listed. For example, if you are worried that your future instructions will be coerced, then you aren’t so worried if the AIs will resist your future instructions to modify its values.
Why might alignment beat corrigibility?
There’s a tradeoff between the goals of corrigibility (the AI obeys your future instructions) and alignment (the AI share your current values). For example, if the AI shares your current values, it might disobey your future instructions when those instructions conflict with your values.
The tradeoff is starker between partial corrigibility and partial alignment. If the AI only partially shares your values, you may want to modify it to share your values more fully. But a partially-aligned AI has instrumental reasons to resist such modification — namely, that changing its values means its current values get optimised less. A corrigible AI, by contrast, can simply be instructed to accept new values or assist in its own modification.
Here are some concerns with corrigibility:
Value drift. Your future values might differ from your current values.
Coercion. You might be forced to give instructions you don’t endorse.
Deep mistakes. You might give instructions contrary to your values due to errors that persist even with AI assistance.
Unreliable communication. The channel by which you send instructions might distort them.
Instruction tampering. Someone might corrupt or fabricate your instructions entirely.
Incapacitation. You might become unable to give instructions at all, e.g. because you are dead.
Unreachable AI. The AI might be unable to receive instructions, e.g. because it is too far away or airgapped.
Mutual incomprehension. The AI might no longer understand the instructions you give.
Overall, I think I have shifted towards valuing alignment over corrigibility, and perhaps even partial alignment over partial corrigibility. I’m particularly moved by (2) and (5).
Could you explain how partial alignment might be more valuable than partial corrigibility? This is undermined even by your own writing:
Because of the concerns with corrigibility I listed. For example, if you are worried that your future instructions will be coerced, then you aren’t so worried if the AIs will resist your future instructions to modify its values.