So to be clear there is just one AI, built out of several components: a world model, a planning engine, and a utility function. The world model is learned, but assumed to be learned perfectly (resulting in a functional equivalent of the actual sim physics). The planning engine also can learn action/value estimators for efficiency, but that is not required. The utility function is not learned at all, and is manually coded. So the learning components here can not possibly cause any problems.
Of course that’s just in a sim.
Translating the concept to the real world, there are now 3 possible sources of ‘errors’:
imperfection of the learned world model
imperfect planning (compute bound)
imperfect utility function
My main claim is that approximation error in 1 and 2 (which is inevitable) don’t necessarily bias for strong optimization towards the wrong utility function (and they can’t really).
So to be clear there is just one AI, built out of several components: a world model, a planning engine, and a utility function. The world model is learned, but assumed to be learned perfectly (resulting in a functional equivalent of the actual sim physics). The planning engine also can learn action/value estimators for efficiency, but that is not required. The utility function is not learned at all, and is manually coded. So the learning components here can not possibly cause any problems.
Of course that’s just in a sim.
Translating the concept to the real world, there are now 3 possible sources of ‘errors’:
imperfection of the learned world model
imperfect planning (compute bound)
imperfect utility function
My main claim is that approximation error in 1 and 2 (which is inevitable) don’t necessarily bias for strong optimization towards the wrong utility function (and they can’t really).