if some training procedure causes the model to reason more about reward hacking,
I’m curious to hear about what kinds of training procedures you would expect to cause models to reason more about reward hacking while not actually rewarding reward hacked outcomes.
I’m curious to hear about what kinds of training procedures you would expect to cause models to reason more about reward hacking while not actually rewarding reward hacked outcomes.