In practice, this means that the world model can get BoMAI to choose any action it wants
So really this is a set of world-models, one for every algorithm for picking actions to present as optimal to BoMAI. Depending on how the actions are chosen by the world-model, either it will be ruled out by Assumption 2 or it will be benign.
Suppose the choice of action depends on outside-world features. (This would be the point of manipulating BoMAI—getting it to take actions with particular outside-world effects). Then, the feature that this world-model associates reward with depends on outside-world events that depend on actions taken, and is ruled out by Assumption 2. And as the parenthetical mentions, if the world-model is not selecting actions to advertise as high-reward based on the outside-world effects of those actions, then the world-model is benign.
This is an interesting world-model.
So really this is a set of world-models, one for every algorithm for picking actions to present as optimal to BoMAI. Depending on how the actions are chosen by the world-model, either it will be ruled out by Assumption 2 or it will be benign.
Suppose the choice of action depends on outside-world features. (This would be the point of manipulating BoMAI—getting it to take actions with particular outside-world effects). Then, the feature that this world-model associates reward with depends on outside-world events that depend on actions taken, and is ruled out by Assumption 2. And as the parenthetical mentions, if the world-model is not selecting actions to advertise as high-reward based on the outside-world effects of those actions, then the world-model is benign.
Only the on-policy computation is accounted for.