Jacob Pfau comments on Prover-Estimator Debate: A New Scalable Oversight Protocol

Jacob Pfau 18 Jun 2025 11:09 UTC
LW: 1 AF: 1
0
AF

I don’t think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight.

Can you elaborate on why you think this? Perhaps you were already aware that obfuscated arguments should only pose a problem for completeness (in which case for the majority of obfuscated-arguments-aware people I’d imagine the new work should be a significant update.) Or perhaps your concern is that in the soundness case the refutation is actually inefficient and the existence of small Bob circuits demonstrated in this paper is too weak to be practically relevant?

Regarding (1.2) I also see this a priority to study. Basically, we already have a strong prior from interacting with current models that LMs will continue to generalise usably from verifiable domains to looser questions. So, the practical relevance of debate hinges on how expensive it is to train with debate to increase the reliability of this generalisation. I have in mind here questions like “What experiment should I run next to test this debate protocol”.

One dataset idea to assess how often stable arguments can be found is to curate ‘proof advice’ problems. These problems are proxies for research advice in general. Basically:

Question: “Does this textbook use theorem Y to prove theorem X?”, where the debaters see the proof but the judge does not.

(Note that this is not about proof correctness, a dishonest debater can offer up alternative correct proofs that use non-standard techniques.) This setup forces arguments based on heuristics and “research taste”—what makes a proof natural or clear. It’s unclear what independent evidence would look like in this domain.
- Beth Barnes 19 Jun 2025 17:57 UTC
  LW: 10 AF: 8
  0
  AF Parent
  I can write more later, but here’s a relevant doc I wrote as part of discussion with Geoffrey + others. Maybe the key point from there is that I don’t think this protocol solves the examples given in the original post describing obfuscated arguments. But yeah, I was always thinking this was a completeness problem (the original post poses the problem as distinguishing a certain class of honest arguments from dishonest obfuscated arguments—not claiming you can’t get soundness by just ignoring any arguments that are plausibly obfuscated.)