Given that the judge that selects the best argument for BoN is the same as the one that chooses the winner, what is your main takeaway from the fact that ELO increases as you increase N? I see this as mainly a sanity check, but want to check if I’m missing something.
Yeah, the intuition that this is a sanity check is basically right. I allude to this in the post, but we also wanted to check that our approximation of self-play was reasonable (where we only provide N arguments for a single side and pick the strongest out of those N, instead of comparing all N arguments for each side against each other and then picking the strongest argument for each side based on that), since the approximation is significantly cheaper to run.
We were also interested in comparing the ELO increases from BoN and RL, to some degree.
Given that the judge that selects the best argument for BoN is the same as the one that chooses the winner, what is your main takeaway from the fact that ELO increases as you increase N? I see this as mainly a sanity check, but want to check if I’m missing something.
Yeah, the intuition that this is a sanity check is basically right. I allude to this in the post, but we also wanted to check that our approximation of self-play was reasonable (where we only provide N arguments for a single side and pick the strongest out of those N, instead of comparing all N arguments for each side against each other and then picking the strongest argument for each side based on that), since the approximation is significantly cheaper to run.
We were also interested in comparing the ELO increases from BoN and RL, to some degree.