The key point is that AUPconceptual relaxes the problem:
If we could robustly penalize the agent for intuitively perceived gains in power (whatever that means), would that solve the problem?
This is not trivial.
Probably I’m just missing something, but I don’t see why you couldn’t say something similar about:
“preserve human autonomy”, “be nice”, “follow norms”, “do what I mean”, “be corrigible”, “don’t do anything I wouldn’t do”, “be obedient”
E.g.
If we could robustly reward the agent for intuitively perceived nice actions (whatever that means), would that solve the problem?
It seems like the main difference is that for power in particular is that there’s more hope that we could formalize power without reference to humans (which seems harder to do for e.g. “niceness”), but then my original point applies.
(This discussion was continued privately – to clarify, I was narrowly arguing that AUPconceptual is correct, but that this should only provide a mild update in favor of implementations working in the superintelligent case.)
Probably I’m just missing something, but I don’t see why you couldn’t say something similar about:
E.g.
It seems like the main difference is that for power in particular is that there’s more hope that we could formalize power without reference to humans (which seems harder to do for e.g. “niceness”), but then my original point applies.
(This discussion was continued privately – to clarify, I was narrowly arguing that AUPconceptual is correct, but that this should only provide a mild update in favor of implementations working in the superintelligent case.)