Dagon comments on Three mental images from thinking about AGI debate & corrigibility

Dagon 3 Aug 2020 20:42 UTC
2 points
0
Deliberation as “debate inside one head”
I worry that this is anthropomorphizing a bit too much. And I think the underlying question is wrong
my gut tells me it’s weird that we would deliberately make an AGI that might knowingly advocate for the wrong answer to a question.
The problem is not that an AGI would knowingly advocate for the wrong answer, it’s that there is always and only one question to ask a self-aligned agent (one with a proper utility function and a reasonably consequentialist decision theory). No matter what you ask it, it will answer the question “what should I say, given everything I know now including your state of mind in making that utterance, that furthers my ends”.
You can’t ask an AGI a question, you can only give it a prompt that reveals something about yourself.