During training, the inner optimizer has the same behavior as the benign model: while it’s still dumb it just doesn’t know how to do better; when it becomes smarter and reaches strategic awareness it will be deceptive.
You’re still assuming that you have a perfect consequentialist trapped in a box.
And sure, if you have an AI that accurately guesses whether it’s in training or not, and if in training performs predictions as intended, and if not in training does some sort of dangerous consequentialist thing, then that AI will do well in the loss function and end up doing some sort of dangerous consequentialist thing once deployed.
But that’s not specific to doing some sort of dangerous consequentialist thing. If you’ve got an AI that accurately guesses whether it’s in training or not, and if in training performs predictions as intended, but otherwise throws null pointer exceptions, then that AI will also do well in the loss function but end up throwing null pointer exceptions once deployed. Or if you’ve got an AI that accurately guesses whether it’s in training or not, and if in training performs predictions as intended, but otherwise shows a single image of a paperclip, then again you have an AI that does well in the loss function but ends up throwing null pointer exceptions once deployed.
The magical step we’re missing is, why would we end up with a perfect consequentialist in a box? That seems like a highly specific hypothesis for what the predictor would do. And if I try to reason about it mechanistically, it doesn’t seem like the standard ways AI gets made, i.e. by gradient descent, would generate that.
Because with gradient descent, you try a bunch of AIs that partly work, and then move in the direction that works better. And so with gradient descent, before you have a perfect consequentialist that can accurately predict whether it’s in training, you’re going to have an imperfect consequentialist that cannot accurately predict whether it’s in training. And this might sometimes accidentally decide that it’s not in training, and output a prediction that’s “intended” to control the world at the cost of some marginal prediction accuracy, and then the gradient is going to notice that something is wrong and is going to turn down the consequentialist. (And yes, this would also encourage deception, but come on, what’s easier—“don’t do advanced planning for how to modify the world and use this to shift your predictions” or “do advanced planning for how to do advanced planning for how to modify the world using your predictions without getting caught”?)
Re: tameness of L∗(μ)=Lμ(μ)−minmLμ(m) (using min cause L is a loss), some things that come to mind are
a) L∗ is always larger than zero, so it can be minimized by a strategy that takes over the input channel and induces random noise so no strategy can do better than random, thus Lμ(μ)≈minmLμ(m).
This works as an optimum for L∗, but here you then have to go for another layer of analysis.L∗ measures the degree to which something is a fix point for the training equation, but obviously only a stable fixed point would actually be reached during the training process. So that raises the question, is the optimum you propose here a stable fixed point?
Let’s consider some strategy that is almost perfectly what you describe. Its previous predictions have caused enough chaos to force the variable it has to predict to be almost random—but not quite. It can now spend its marginal resources on two things:
Introduce even more chaos, likely at the expense of immediate predictive power
Predict the little bit of signal, likely at the expense of being unable to make as much chaos
Due to the myopia, gradient descent will favor the latter and completely ignore the former. But the latter moves it away from the fixed point, while the former moves it towards the fixed point. So your proposed fixed point is unstable.
b) Depending on which model class the min is taken over, the model can get less than zero loss by hacking its environment to get more compute (thus escaping the model class in the min)
The other models would also get access to this compute, that’s sort of the point of the model.
You’re still assuming that you have a perfect consequentialist trapped in a box.
And sure, if you have an AI that accurately guesses whether it’s in training or not, and if in training performs predictions as intended, and if not in training does some sort of dangerous consequentialist thing, then that AI will do well in the loss function and end up doing some sort of dangerous consequentialist thing once deployed.
But that’s not specific to doing some sort of dangerous consequentialist thing. If you’ve got an AI that accurately guesses whether it’s in training or not, and if in training performs predictions as intended, but otherwise throws null pointer exceptions, then that AI will also do well in the loss function but end up throwing null pointer exceptions once deployed. Or if you’ve got an AI that accurately guesses whether it’s in training or not, and if in training performs predictions as intended, but otherwise shows a single image of a paperclip, then again you have an AI that does well in the loss function but ends up throwing null pointer exceptions once deployed.
The magical step we’re missing is, why would we end up with a perfect consequentialist in a box? That seems like a highly specific hypothesis for what the predictor would do. And if I try to reason about it mechanistically, it doesn’t seem like the standard ways AI gets made, i.e. by gradient descent, would generate that.
Because with gradient descent, you try a bunch of AIs that partly work, and then move in the direction that works better. And so with gradient descent, before you have a perfect consequentialist that can accurately predict whether it’s in training, you’re going to have an imperfect consequentialist that cannot accurately predict whether it’s in training. And this might sometimes accidentally decide that it’s not in training, and output a prediction that’s “intended” to control the world at the cost of some marginal prediction accuracy, and then the gradient is going to notice that something is wrong and is going to turn down the consequentialist. (And yes, this would also encourage deception, but come on, what’s easier—“don’t do advanced planning for how to modify the world and use this to shift your predictions” or “do advanced planning for how to do advanced planning for how to modify the world using your predictions without getting caught”?)
This works as an optimum for L∗, but here you then have to go for another layer of analysis.L∗ measures the degree to which something is a fix point for the training equation, but obviously only a stable fixed point would actually be reached during the training process. So that raises the question, is the optimum you propose here a stable fixed point?
Let’s consider some strategy that is almost perfectly what you describe. Its previous predictions have caused enough chaos to force the variable it has to predict to be almost random—but not quite. It can now spend its marginal resources on two things:
Introduce even more chaos, likely at the expense of immediate predictive power
Predict the little bit of signal, likely at the expense of being unable to make as much chaos
Due to the myopia, gradient descent will favor the latter and completely ignore the former. But the latter moves it away from the fixed point, while the former moves it towards the fixed point. So your proposed fixed point is unstable.
The other models would also get access to this compute, that’s sort of the point of the model.