> So one naive notion is “% of rollouts on which they get the same score.”
rollouts from which policy? In RL lingo, you can talk about something like # of (s,a) pairs on which the rewards differ which marginalizes out the policy (though I am not sure how instructive such formulation would be for LLMs given dynamics are kind of implicitly specified by the LLM itself). You could then talk about differences in occupancy measures induced by the different reward functions as a “policy space” dual of “differences in reward functions”.
https://arxiv.org/abs/2209.13085 is probably a good point to start.
> So one naive notion is “% of rollouts on which they get the same score.”
rollouts from which policy? In RL lingo, you can talk about something like # of (s,a) pairs on which the rewards differ which marginalizes out the policy (though I am not sure how instructive such formulation would be for LLMs given dynamics are kind of implicitly specified by the LLM itself). You could then talk about differences in occupancy measures induced by the different reward functions as a “policy space” dual of “differences in reward functions”.