Yeah, the intuition that this is a sanity check is basically right. I allude to this in the post, but we also wanted to check that our approximation of self-play was reasonable (where we only provide N arguments for a single side and pick the strongest out of those N, instead of comparing all N arguments for each side against each other and then picking the strongest argument for each side based on that), since the approximation is significantly cheaper to run.
We were also interested in comparing the ELO increases from BoN and RL, to some degree.
Yeah, the intuition that this is a sanity check is basically right. I allude to this in the post, but we also wanted to check that our approximation of self-play was reasonable (where we only provide N arguments for a single side and pick the strongest out of those N, instead of comparing all N arguments for each side against each other and then picking the strongest argument for each side based on that), since the approximation is significantly cheaper to run.
We were also interested in comparing the ELO increases from BoN and RL, to some degree.