(This “better almost 50% of the time” property is one way of trying to operationalize “we don’t want the filtered policy to be worse”. It so happens that this property is actually kind of badly behaved, but in our case it seems fine, given that we’re always going to be comparing against a fixed unfiltered distribution.)
I’ve read the intransitive dice page, but I’m confused on how it might apply here? Like concretely, what are the dice in the analogy?
Suppose you have three text-generation policies, and you define “policy X is better than policy Y” as “when a human is given a sample from both policy X and policy Y, they prefer the sample from the latter more than half the time”. That definition of “better” is intransitive.
Hum, I see. And is your point that it should not create a problem because you’re only doing comparison X vs Y and Z vs Y (where Y is the standard policy and X and Z are two of your conservative policies) but you don’t really care about the comparison between X and Z?
I’ve read the intransitive dice page, but I’m confused on how it might apply here? Like concretely, what are the dice in the analogy?
Suppose you have three text-generation policies, and you define “policy X is better than policy Y” as “when a human is given a sample from both policy X and policy Y, they prefer the sample from the latter more than half the time”. That definition of “better” is intransitive.
Hum, I see. And is your point that it should not create a problem because you’re only doing comparison X vs Y and Z vs Y (where Y is the standard policy and X and Z are two of your conservative policies) but you don’t really care about the comparison between X and Z?