note for posterity: @carado and I have talked about this at length since this post and I now am mostly convinced that this is workable. I would currently describe the question (slightly metaphorically) as an “intentional glitch token”, in that it is specifically designed to be a large random blob that cannot be inferred except by exploring, and which, since it gates all utility, causes the inner-aligned system to be extremely cautious.
I’ve been pondering that, and a thing I have been about to bring up and might as well mention in this comment is that this may cause an inner-aligned utility maximizer to sit around doing nothing forever out of abundance of caution, since it can’t identify worlds where it can be sure it can identify the configuration of the world that actually increases its utility function.
note for posterity: @carado and I have talked about this at length since this post and I now am mostly convinced that this is workable. I would currently describe the question (slightly metaphorically) as an “intentional glitch token”, in that it is specifically designed to be a large random blob that cannot be inferred except by exploring, and which, since it gates all utility, causes the inner-aligned system to be extremely cautious.
I’ve been pondering that, and a thing I have been about to bring up and might as well mention in this comment is that this may cause an inner-aligned utility maximizer to sit around doing nothing forever out of abundance of caution, since it can’t identify worlds where it can be sure it can identify the configuration of the world that actually increases its utility function.