A lot of the time, we consider our models to be functions from parameters and inputs to outputs, and we imagine training the parameters with SGD. One notable feature of this setup is that SGD isn’t by default purposefully trying to kill you—it might find a model that kills you, or a model that gradient hacks and then kills you, but this is more like incompetence/indifference on SGD’s part, rather than malice.
A plausible objection to this framing is that much of the knowledge of our models is probably going to be produced in other ways than SGD. For example, the models might write down various notes (in natural language or in neuralese) that they then read later, and they might have internal structures like economies that produce and consume information. Does this introduce new alignment problems?
Here’s a way I’ve been thinking about this recently. I’m writing this in way that might feel obnoxiously overwrought because this is the way that I think would have conveyed my current intuition to me two months ago.
In SGD, we update our weights by something like:
weights ← weights + alpha * (d loss/d weights)
You might think that this is fundamental. But actually it’s just a special case of the more general life rule:
do something that seems like a good idea, based on the best available estimate of what’s a good idea
Imagine that you want a good language model, and you randomly initialize it and now you’re sitting at your computer unsure what to do next, because (as a totally random model) it’s totally useless to you. What’s a good idea to do?
Your best idea is to slightly perturb the model based on a local estimate of which direction in parameter space will reduce log loss, evaluated on a random batch. So you take that derivative and update your weights with it.
And then you do this over and over again, basically because you don’t have any better ideas for what to do.
(I think that the “we’re just trying to do something that seems good based on our best guess” perspective is also an intuition that I was missing for why adversarial training is good, or why using a more conservative oversight process during adversarial training might be good. We’re just trying to make an informed guess based on our available estimate about what parameter update will maximize goodness, and we think that the gradient of a conservative loss evaluated on adversarially chosen data might be a useful contributor to that best-guess parameter update.)
If you were training an AGI instead of a dumb LM, the situation would be different, because when you have a partially trained AGI, your sense of what’s a good idea might have materially changed, because the partially trained AGI can contribute to your effort to estimate what’s a good idea to do next.
The simplest case of this is that you use your AGI when you’re calculating the loss for an answer. This is the basic idea behind IDA/debate—during training, your sense of how good an answer is gets better for the obvious reason. And if your model can deceive you at this point, you’ll maybe pick bad losses, then get bad gradient estimates, then maybe eventually get paperclipped.
But we can also do more complicated things than estimating a loss. For example, at some point our AI might know something about what changes to its parameters would lead to outcomes that we liked better, or that would lead to more paperclips. At this point, we need to be able to access the model’s opinions about parameter changes that will lead to good outcomes for us (or else we’re not competitive), and we also want to avoid the model using its knowledge of parameter changes that would lead to paperclips to gradient hack.
(Just to rephrase/reiterate a point in that last paragraph that I hadn’t thought of before: If your model is able to gradient hack, then it’s apparently better than SGD at at least one aspect of guessing what parameter changes will lead to what long-term outcomes. And so for competitiveness reasons, you might need your weight update rule to be “ask the model for its favorite new choice of weights (perhaps giving it the loss gradient as an input that it can use as it wishes) and use its suggestion”. And if the AI is misaligned at this point, you are obviously completely fucked re gradient hacking. So it is kind of missing the point to argue about whether the model practically is able to gradient hack just via choosing actions.)
And so the hope for competitive alignment has to go via an inductive property—you’re using all the model’s relevant knowledge to update your model at every step, and so everything is going as fast as possible and maintaining alignment.
And this setup is basically the same for any other mechanism via which your AI might influence its future behavior, including writing notes-to-self or having some global memory bank or whatever.
And so in conclusion:
Gradient hacking isn’t really a different problem than needing to have access to the model’s knowledge in order to provide a good loss.
Gradient hacking isn’t really a different problem than handling other mechanisms by which the AI’s actions affect its future actions, and so it’s fine for us to just talk about having parameters and an update rule.
Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.
One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don’t think I can make suggestions about what my parameters should be changed to.
How can I gradient hack myself? For example, by thinking of strawberries every time I’m about to get a reward. Now I’ve hacked myself to like strawberries. But I have no idea how that’s implemented in my brain, I can’t “pick the parameters for myself”, even if you gave me a big tensor of gradients.
Two potential alternatives to the thing you said:
maybe competitive alignment schemes need to be robust to models gradient hacking themselves towards being more capable (although idk why this would make a difference).
maybe competitive alignment schemes need to be robust to models (sometimes) choosing their own rewards to reinforce competent behaviour. (Obviously can’t let them do it too often or else your model just wireheads.)
[epistemic status: speculative]
A lot of the time, we consider our models to be functions from parameters and inputs to outputs, and we imagine training the parameters with SGD. One notable feature of this setup is that SGD isn’t by default purposefully trying to kill you—it might find a model that kills you, or a model that gradient hacks and then kills you, but this is more like incompetence/indifference on SGD’s part, rather than malice.
A plausible objection to this framing is that much of the knowledge of our models is probably going to be produced in other ways than SGD. For example, the models might write down various notes (in natural language or in neuralese) that they then read later, and they might have internal structures like economies that produce and consume information. Does this introduce new alignment problems?
Here’s a way I’ve been thinking about this recently. I’m writing this in way that might feel obnoxiously overwrought because this is the way that I think would have conveyed my current intuition to me two months ago.
In SGD, we update our weights by something like:
weights ← weights + alpha * (d loss/d weights)
You might think that this is fundamental. But actually it’s just a special case of the more general life rule:
do something that seems like a good idea, based on the best available estimate of what’s a good idea
Imagine that you want a good language model, and you randomly initialize it and now you’re sitting at your computer unsure what to do next, because (as a totally random model) it’s totally useless to you. What’s a good idea to do?
Your best idea is to slightly perturb the model based on a local estimate of which direction in parameter space will reduce log loss, evaluated on a random batch. So you take that derivative and update your weights with it.
And then you do this over and over again, basically because you don’t have any better ideas for what to do.
(I think that the “we’re just trying to do something that seems good based on our best guess” perspective is also an intuition that I was missing for why adversarial training is good, or why using a more conservative oversight process during adversarial training might be good. We’re just trying to make an informed guess based on our available estimate about what parameter update will maximize goodness, and we think that the gradient of a conservative loss evaluated on adversarially chosen data might be a useful contributor to that best-guess parameter update.)
If you were training an AGI instead of a dumb LM, the situation would be different, because when you have a partially trained AGI, your sense of what’s a good idea might have materially changed, because the partially trained AGI can contribute to your effort to estimate what’s a good idea to do next.
The simplest case of this is that you use your AGI when you’re calculating the loss for an answer. This is the basic idea behind IDA/debate—during training, your sense of how good an answer is gets better for the obvious reason. And if your model can deceive you at this point, you’ll maybe pick bad losses, then get bad gradient estimates, then maybe eventually get paperclipped.
But we can also do more complicated things than estimating a loss. For example, at some point our AI might know something about what changes to its parameters would lead to outcomes that we liked better, or that would lead to more paperclips. At this point, we need to be able to access the model’s opinions about parameter changes that will lead to good outcomes for us (or else we’re not competitive), and we also want to avoid the model using its knowledge of parameter changes that would lead to paperclips to gradient hack.
(Just to rephrase/reiterate a point in that last paragraph that I hadn’t thought of before: If your model is able to gradient hack, then it’s apparently better than SGD at at least one aspect of guessing what parameter changes will lead to what long-term outcomes. And so for competitiveness reasons, you might need your weight update rule to be “ask the model for its favorite new choice of weights (perhaps giving it the loss gradient as an input that it can use as it wishes) and use its suggestion”. And if the AI is misaligned at this point, you are obviously completely fucked re gradient hacking. So it is kind of missing the point to argue about whether the model practically is able to gradient hack just via choosing actions.)
And so the hope for competitive alignment has to go via an inductive property—you’re using all the model’s relevant knowledge to update your model at every step, and so everything is going as fast as possible and maintaining alignment.
And this setup is basically the same for any other mechanism via which your AI might influence its future behavior, including writing notes-to-self or having some global memory bank or whatever.
And so in conclusion:
Gradient hacking isn’t really a different problem than needing to have access to the model’s knowledge in order to provide a good loss.
Gradient hacking isn’t really a different problem than handling other mechanisms by which the AI’s actions affect its future actions, and so it’s fine for us to just talk about having parameters and an update rule.
Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.
One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don’t think I can make suggestions about what my parameters should be changed to.
How can I gradient hack myself? For example, by thinking of strawberries every time I’m about to get a reward. Now I’ve hacked myself to like strawberries. But I have no idea how that’s implemented in my brain, I can’t “pick the parameters for myself”, even if you gave me a big tensor of gradients.
Two potential alternatives to the thing you said:
maybe competitive alignment schemes need to be robust to models gradient hacking themselves towards being more capable (although idk why this would make a difference).
maybe competitive alignment schemes need to be robust to models (sometimes) choosing their own rewards to reinforce competent behaviour. (Obviously can’t let them do it too often or else your model just wireheads.)
In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7