I feel that more “deliberate” reward signals could be an interesting direction, but at the same time I feel that the “overarching ML lesson” is that approximating complex signals instead of computing them directly always wins (e.g. PPO vs TRPO). At the very least, I think you’d need to train the reviewer model for this specific task.
However, I think our disagreement on the first 3 points is somewhat fundamental so I’ll put it aside for now. Please let me know if you think more feedback / discussion might be useful!
I am still curious on why you think that the model won’t get performative? e.g. on questions that its correct it will add unnecessary double checking, or make intentional mistakes so it could heroically catch them? Maybe you could try specifying the reward you had more concretely?
Thanks for the reply!
I feel that more “deliberate” reward signals could be an interesting direction, but at the same time I feel that the “overarching ML lesson” is that approximating complex signals instead of computing them directly always wins (e.g. PPO vs TRPO). At the very least, I think you’d need to train the reviewer model for this specific task.
However, I think our disagreement on the first 3 points is somewhat fundamental so I’ll put it aside for now. Please let me know if you think more feedback / discussion might be useful!
I am still curious on why you think that the model won’t get performative? e.g. on questions that its correct it will add unnecessary double checking, or make intentional mistakes so it could heroically catch them? Maybe you could try specifying the reward you had more concretely?
Thanks!