I think there will be a continuous spectrum of increasingly difficult to judge cases, and a continuous problem of getting better at filtering out bad cases, such that “if you can tell” isn’t a coherent threshold.
I agree with this, but then I don’t understand how this solution helps? Like, here we have a case where we can still tell that the environment is being reward hacked, and we tell the model it’s fine. Tomorrow the model will encounter an environment where we can’t tell that it’s reward hacking, so the model will also think it’s fine, and then we don’t have a feedback loop anymore, and now we just have a model that is happily deceiving us.
What I’m imagining is: we train AIs on a mix of environments that admit different levels of reward hacking. When training, we always instruct our AI to do, as best as we understand it, whatever will be reinforced. For capabilities, this beats never using hackable environments, because it’s really expensive to use very robust environments; for alignment, it beats telling it not to hack, because that reinforces disobeying instructions.
In the limit, this runs into problems where we have very limited information about what reward hacking opportunities are present in the training environments, so the only instruction we can be confident is consistent with the grader is “do whatever will receive a high score from the grader”, which will… underspecify… deployment behavior, to put it mildly.
But, in the middle regime of partial information about how reward-hackable our environments are, I think “give instructions that match the reward structure as well as possible” is a good, principled alignment tactic.
Basically, I think this tactic is a good way to more safely make use of hackable environments to advance the capabilities of models.
I agree with this, but then I don’t understand how this solution helps? Like, here we have a case where we can still tell that the environment is being reward hacked, and we tell the model it’s fine. Tomorrow the model will encounter an environment where we can’t tell that it’s reward hacking, so the model will also think it’s fine, and then we don’t have a feedback loop anymore, and now we just have a model that is happily deceiving us.
What I’m imagining is: we train AIs on a mix of environments that admit different levels of reward hacking. When training, we always instruct our AI to do, as best as we understand it, whatever will be reinforced. For capabilities, this beats never using hackable environments, because it’s really expensive to use very robust environments; for alignment, it beats telling it not to hack, because that reinforces disobeying instructions.
In the limit, this runs into problems where we have very limited information about what reward hacking opportunities are present in the training environments, so the only instruction we can be confident is consistent with the grader is “do whatever will receive a high score from the grader”, which will… underspecify… deployment behavior, to put it mildly.
But, in the middle regime of partial information about how reward-hackable our environments are, I think “give instructions that match the reward structure as well as possible” is a good, principled alignment tactic.
Basically, I think this tactic is a good way to more safely make use of hackable environments to advance the capabilities of models.