only one round of initial question-answer still seems very bad to me. it’s very hard to get branch coverage of a brain. random data definitely won’t do it.
to be clear, the data isn’t what the AI uses to learn about what the human says, the data is what the AI uses to know which thing in the world is the human, so it can then simply for example ask or brainscan the human, or learn about what it’d do in the first iteration in any number of other ways.
note for posterity: @carado and I have talked about this at length since this post and I now am mostly convinced that this is workable. I would currently describe the question (slightly metaphorically) as an “intentional glitch token”, in that it is specifically designed to be a large random blob that cannot be inferred except by exploring, and which, since it gates all utility, causes the inner-aligned system to be extremely cautious.
I’ve been pondering that, and a thing I have been about to bring up and might as well mention in this comment is that this may cause an inner-aligned utility maximizer to sit around doing nothing forever out of abundance of caution, since it can’t identify worlds where it can be sure it can identify the configuration of the world that actually increases its utility function.
only one round of initial question-answer still seems very bad to me. it’s very hard to get branch coverage of a brain. random data definitely won’t do it.
to be clear, the data isn’t what the AI uses to learn about what the human says, the data is what the AI uses to know which thing in the world is the human, so it can then simply for example ask or brainscan the human, or learn about what it’d do in the first iteration in any number of other ways.
note for posterity: @carado and I have talked about this at length since this post and I now am mostly convinced that this is workable. I would currently describe the question (slightly metaphorically) as an “intentional glitch token”, in that it is specifically designed to be a large random blob that cannot be inferred except by exploring, and which, since it gates all utility, causes the inner-aligned system to be extremely cautious.
I’ve been pondering that, and a thing I have been about to bring up and might as well mention in this comment is that this may cause an inner-aligned utility maximizer to sit around doing nothing forever out of abundance of caution, since it can’t identify worlds where it can be sure it can identify the configuration of the world that actually increases its utility function.