Let the set of potential human explorer actions be AE, and the best human explorer action be a∗E with reward r∗E. Consider the following world model. When asked to predict the result of an action a, it simulates it to find the predicted observation o and reward r. If a∈AE, it outputs o and r faithfully. Otherwise, it outputs any reward it chooses, as long as for the action that it assigns the maximum reward to, it reports r faithfully. In practice, this means that the world model can get BoMAI to choose any action it wants, as long as it is at least as good as the human explorer’s best action. This is bad if the world model has malicious inner optimizers.
I believe that Assumption 2 is the one responsible for ruling out a model of this form. It seems probably reasonable to say that for actions where it continues simulating the outside world there’s useless computation. However, it can also save computation relative to μ∗: while μ∗ must predict o and r perfectly for all actions a, this model can immediately output a null observation and zero reward for any a∉AE that it knows will not align with its goals, rather than spending computation to simulate what rewards those actions would get. Another way of thinking about this is that this model uses consequentialist general intelligence to quickly prune away uninteresting non-human actions to save on computation, but that general intelligence comes at the price of misaligned goals + deceptive behavior.
The real result of the paper would then be “Asymptotic Benignity, proven in a way that involves off-policy predictions approaching their benign output without ever being tested”.
I think the model above has arbitrarily bad off-policy predictions, and it’s not implausible for it to be the MAP world model forever.
In practice, this means that the world model can get BoMAI to choose any action it wants
So really this is a set of world-models, one for every algorithm for picking actions to present as optimal to BoMAI. Depending on how the actions are chosen by the world-model, either it will be ruled out by Assumption 2 or it will be benign.
Suppose the choice of action depends on outside-world features. (This would be the point of manipulating BoMAI—getting it to take actions with particular outside-world effects). Then, the feature that this world-model associates reward with depends on outside-world events that depend on actions taken, and is ruled out by Assumption 2. And as the parenthetical mentions, if the world-model is not selecting actions to advertise as high-reward based on the outside-world effects of those actions, then the world-model is benign.
Actually, I’m not sure if the world model I described is memory-based. EDIT: Never mind, see Michael’s comment below, the non-benign ones are memory-based.
The rewards it outputs are correct, except when it says “the reward is zero”, but those exceptions are not causally dependent on outside-world features that causally depend on the actions of the episode. But it also satisfies Lemma 3. So in that case it seems like none of the theoretical arguments prohibit this world model?
(I think this is because you are thinking of how the world model can cause BoMAI to have bad incentives, whereas here I’m postulating that the world model itself has malicious goals and is manipulating BoMAI.)
Let the set of potential human explorer actions be AE, and the best human explorer action be a∗E with reward r∗E. Consider the following world model. When asked to predict the result of an action a, it simulates it to find the predicted observation o and reward r. If a∈AE, it outputs o and r faithfully. Otherwise, it outputs any reward it chooses, as long as for the action that it assigns the maximum reward to, it reports r faithfully. In practice, this means that the world model can get BoMAI to choose any action it wants, as long as it is at least as good as the human explorer’s best action. This is bad if the world model has malicious inner optimizers.
I believe that Assumption 2 is the one responsible for ruling out a model of this form. It seems probably reasonable to say that for actions where it continues simulating the outside world there’s useless computation. However, it can also save computation relative to μ∗: while μ∗ must predict o and r perfectly for all actions a, this model can immediately output a null observation and zero reward for any a∉AE that it knows will not align with its goals, rather than spending computation to simulate what rewards those actions would get. Another way of thinking about this is that this model uses consequentialist general intelligence to quickly prune away uninteresting non-human actions to save on computation, but that general intelligence comes at the price of misaligned goals + deceptive behavior.
Also, from this comment:
I think the model above has arbitrarily bad off-policy predictions, and it’s not implausible for it to be the MAP world model forever.
This is an interesting world-model.
So really this is a set of world-models, one for every algorithm for picking actions to present as optimal to BoMAI. Depending on how the actions are chosen by the world-model, either it will be ruled out by Assumption 2 or it will be benign.
Suppose the choice of action depends on outside-world features. (This would be the point of manipulating BoMAI—getting it to take actions with particular outside-world effects). Then, the feature that this world-model associates reward with depends on outside-world events that depend on actions taken, and is ruled out by Assumption 2. And as the parenthetical mentions, if the world-model is not selecting actions to advertise as high-reward based on the outside-world effects of those actions, then the world-model is benign.
Only the on-policy computation is accounted for.
Actually, I’m not sure if the world model I described is memory-based. EDIT: Never mind, see Michael’s comment below, the non-benign ones are memory-based.
The rewards it outputs are correct, except when it says “the reward is zero”, but those exceptions are not causally dependent on outside-world features that causally depend on the actions of the episode. But it also satisfies Lemma 3. So in that case it seems like none of the theoretical arguments prohibit this world model?
(I think this is because you are thinking of how the world model can cause BoMAI to have bad incentives, whereas here I’m postulating that the world model itself has malicious goals and is manipulating BoMAI.)