how is the model supposed to know which trajectory is cheating? there is the super smart part which understands in some implicit sense but won’t necessarily tell the assistant part; the assistant part is not good enough at code or whatever to know by itself, and has to try to elicit stuff from the code part, which it may or may not succeed at.
I think maybe this is the crux. Assuming the model starts out robustly aligned, and is bootstrapping in an on-policy way, it should be able to tell if its own trajectory is cheating or not. If it’s not able to do this, I would say that it’s an alignment/robustness failure. It seems difficult to accidentally reward-hack in way that the robustly aligned model we started with doesn’t detect after reviewing the trajectory.
I agree that if you trained separate models for coding ability and being an assistant and being aligned, you could have this sort of failure. But the gradient update applies to the full model, right? Why is it that the robustly aligned model we started out with after an update, which (according to it) wasn’t reward hacking, is so unaware of its newfound coding ability as to not continue being robustly aligned?
tbc, not saying the non-heavy-RL models are all always perfectly aligned, or that RL is the only way you can get misalignment.
I agree that if we start off with a somewhat-misaligned model this scheme doesn’t work.
It seems difficult to accidentally reward-hack in way that the robustly aligned model we started with doesn’t detect after reviewing the trajectory.
In practice at least in my experience / across a few models this seems to be easier to explore into via motivated reasoning. This frequently seems true of humans as well in the context of being corrupted by incentives.[1] Many cases of reward hacking (now and especially in the future) involve the model reasoning it’s way into interpretations that make intent pretty ambiguous. Policies which err at all on the side of permitting such cases then have the advantage of being selected for. You could imagine some setup where a model is always also reasoning about how future updates will effect it, such that it’s cautious about this, but you’re still subject to the same effects and this becomes a question of needing to reliably “training game but for good” in a way that holds.[2]
I think maybe this is the crux. Assuming the model starts out robustly aligned, and is bootstrapping in an on-policy way, it should be able to tell if its own trajectory is cheating or not. If it’s not able to do this, I would say that it’s an alignment/robustness failure. It seems difficult to accidentally reward-hack in way that the robustly aligned model we started with doesn’t detect after reviewing the trajectory.
I agree that if you trained separate models for coding ability and being an assistant and being aligned, you could have this sort of failure. But the gradient update applies to the full model, right? Why is it that the robustly aligned model we started out with after an update, which (according to it) wasn’t reward hacking, is so unaware of its newfound coding ability as to not continue being robustly aligned?
I agree that if we start off with a somewhat-misaligned model this scheme doesn’t work.
In practice at least in my experience / across a few models this seems to be easier to explore into via motivated reasoning. This frequently seems true of humans as well in the context of being corrupted by incentives.[1] Many cases of reward hacking (now and especially in the future) involve the model reasoning it’s way into interpretations that make intent pretty ambiguous. Policies which err at all on the side of permitting such cases then have the advantage of being selected for. You could imagine some setup where a model is always also reasoning about how future updates will effect it, such that it’s cautious about this, but you’re still subject to the same effects and this becomes a question of needing to reliably “training game but for good” in a way that holds.[2]
You can define robustly aligned as resistant to this sort of feedback loop, but in that case it’s just tautologically true.
This territory also seems super unexplored currently, i.e. both models capable enough to do this, what happens here under lots of reflection, etc