One possible direction that would be interesting to explore: all your pairings are same-family. Same-family models likely share some core reasoning, thus
1) debate transcript from the stronger models might help the weaker judge from the same-family more than the one from different families as same-family models might understand each other better,
2) but they might also share failure modes, meaning a same-family critic might be systematically blind to the same errors as the weaker judge.
A cross-family testing might surface qualitatively different objections, potentially widening or narrowing the classifier gap.
Interesting work.
One possible direction that would be interesting to explore: all your pairings are same-family. Same-family models likely share some core reasoning, thus
1) debate transcript from the stronger models might help the weaker judge from the same-family more than the one from different families as same-family models might understand each other better,
2) but they might also share failure modes, meaning a same-family critic might be systematically blind to the same errors as the weaker judge.
A cross-family testing might surface qualitatively different objections, potentially widening or narrowing the classifier gap.