I worry that this is anthropomorphizing a bit too much. And I think the underlying question is wrong
my gut tells me it’s weird that we would deliberately make an AGI that might knowingly advocate for the wrong answer to a question.
The problem is not that an AGI would knowingly advocate for the wrong answer, it’s that there is always and only one question to ask a self-aligned agent (one with a proper utility function and a reasonably consequentialist decision theory). No matter what you ask it, it will answer the question “what should I say, given everything I know now including your state of mind in making that utterance, that furthers my ends”.
You can’t ask an AGI a question, you can only give it a prompt that reveals something about yourself.
I worry that this is anthropomorphizing a bit too much. And I think the underlying question is wrong
The problem is not that an AGI would knowingly advocate for the wrong answer, it’s that there is always and only one question to ask a self-aligned agent (one with a proper utility function and a reasonably consequentialist decision theory). No matter what you ask it, it will answer the question “what should I say, given everything I know now including your state of mind in making that utterance, that furthers my ends”.
You can’t ask an AGI a question, you can only give it a prompt that reveals something about yourself.