Suppose you have three text-generation policies, and you define “policy X is better than policy Y” as “when a human is given a sample from both policy X and policy Y, they prefer the sample from the latter more than half the time”. That definition of “better” is intransitive.
Hum, I see. And is your point that it should not create a problem because you’re only doing comparison X vs Y and Z vs Y (where Y is the standard policy and X and Z are two of your conservative policies) but you don’t really care about the comparison between X and Z?
Suppose you have three text-generation policies, and you define “policy X is better than policy Y” as “when a human is given a sample from both policy X and policy Y, they prefer the sample from the latter more than half the time”. That definition of “better” is intransitive.
Hum, I see. And is your point that it should not create a problem because you’re only doing comparison X vs Y and Z vs Y (where Y is the standard policy and X and Z are two of your conservative policies) but you don’t really care about the comparison between X and Z?