Gurkenglas comments on Prover-Estimator Debate: A New Scalable Oversight Protocol

Gurkenglas 18 Jun 2025 8:23 UTC
LW: 9 AF: 5
0
AF
What about when A knows where the flaw is but B doesn’t?
- Geoffrey Irving 18 Jun 2025 11:03 UTC
  LW: 1 AF: 1
  2
  AF Parent
  The setting is where A and B have access to the same set of heuristics. This is modeled explicitly in the paper as a shared set of functions they can call, but corresponds to them being the same model or similar for LLM training.
  - Gurkenglas 18 Jun 2025 11:22 UTC
    LW: 5 AF: 3
    0
    AF Parent
    Sure, and I expect there are reasons for assuming that A can’t hide a flaw from B, e.g. maybe one can prove that honest and dishonest can’t be distinguished if it can, but I don’t see this mentioned anywhere.
    - Geoffrey Irving 18 Jun 2025 13:21 UTC
      LW: 1 AF: 1
      2
      AF Parent
      I think this maybe a forest vs. trees thing where we failed to say the thing explicitly because we’ve been intentionally living in the “Alice and Bob have similar heuristics and computational power” for years: the $A$ set in the paper models the set of heuristics, but I am surprised that the term “heuristic” ended up not in the final text (we just speak a lot about relative computational resources).
      
      Fundamentally if you let B attend to either A’s thoughts or to A’s context and also ensure A and B are similar, we think A shouldn’t be able to hide things from B. If A is a very different model, or a stronger model, this won’t hold, and we should expect arguments which A can navigate in ways B can’t defend against. So in order to get anything like this to go through you need some kind of “B knows what A knows” invariant.