Right, so I’m still thinking about it from the “what it was built to optimize” step. You want to try to build the AGI to optimize for human values, right? So you do your best to explain to it what you mean by your human values. But then you fail at explaining and it starts optimizing something else instead.
But suppose the AGI is a super-intelligent human. Now you can just ask it to “optimize for human values” in those exact words (although you probably want to explain it a bit better, just to be on the safe side).
Right, so I’m still thinking about it from the “what it was built to optimize” step. You want to try to build the AGI to optimize for human values, right? So you do your best to explain to it what you mean by your human values. But then you fail at explaining and it starts optimizing something else instead.
But suppose the AGI is a super-intelligent human. Now you can just ask it to “optimize for human values” in those exact words (although you probably want to explain it a bit better, just to be on the safe side).