Interesting work.
One possible direction that would be interesting to explore: all your pairings are same-family. Same-family models likely share some core reasoning, thus
1) debate transcript from the stronger models might help the weaker judge from the same-family more than the one from different families as same-family models might understand each other better,
2) but they might also share failure modes, meaning a same-family critic might be systematically blind to the same errors as the weaker judge.
A cross-family testing might surface qualitatively different objections, potentially widening or narrowing the classifier gap.
More experienced, long-term politicians tend to have a dedicated writing team, who may be less open to use AI in writing speeches. It would be interesting if we divide politicians by the level of experience and see how these statistics appear.