Geoffrey Irving 27 Jun 2025 7:15 UTC
LW: 1 AF: 1
0
AF
in reply to: ProgramCrafter’s comment on: The need to relativise in debate
I think both protocols mentioned (MIP* = RE and the pointers one) already do what you want here. In the background the provers have to do unbounded work to prepare for the stuff they show the verifier, but the verifier’s work is limited to a fixed polynomial in the input size.

And more strongly: in the pointer version where we have two competing provers, a malicious prover can’t force an honest prover to do significantly more work than would be required in an honest case.

The need to relativise in debate

Geoffrey Irving and Simon Marshall

26 Jun 2025 16:23 UTC

31 points

2 comments5 min readLW link

Geoffrey Irving 18 Jun 2025 20:00 UTC
LW: 3 AF: 2
0
AF
in reply to: Beth Barnes’s comment on: Prover-Estimator Debate: A New Scalable Oversight Protocol
I would love the two of you (Beth and @Jacob Pfau) to talk about this in detail, if you’re up for it! Getting the experimental design right is key is we want to get more human participant experiments going and learn from them. The specific point of “have a high standard for efficacy” was something I was emphasising to Jacob a few weeks ago as having distinguished your experiments from some of the follow-ons.

Geoffrey Irving 18 Jun 2025 13:21 UTC
LW: 1 AF: 1
2
AF
in reply to: Gurkenglas’s comment on: Prover-Estimator Debate: A New Scalable Oversight Protocol
I think this maybe a forest vs. trees thing where we failed to say the thing explicitly because we’ve been intentionally living in the “Alice and Bob have similar heuristics and computational power” for years: the $A$ set in the paper models the set of heuristics, but I am surprised that the term “heuristic” ended up not in the final text (we just speak a lot about relative computational resources).

Fundamentally if you let B attend to either A’s thoughts or to A’s context and also ensure A and B are similar, we think A shouldn’t be able to hide things from B. If A is a very different model, or a stronger model, this won’t hold, and we should expect arguments which A can navigate in ways B can’t defend against. So in order to get anything like this to go through you need some kind of “B knows what A knows” invariant.

Geoffrey Irving 18 Jun 2025 11:03 UTC
LW: 1 AF: 1
2
AF
in reply to: Gurkenglas’s comment on: Prover-Estimator Debate: A New Scalable Oversight Protocol
The setting is where A and B have access to the same set of heuristics. This is modeled explicitly in the paper as a shared set of functions they can call, but corresponds to them being the same model or similar for LLM training.

Geoffrey Irving 18 Jun 2025 7:23 UTC
LW: 1 AF: 1
0
AF
in reply to: Beth Barnes’s comment on: Prover-Estimator Debate: A New Scalable Oversight Protocol
The requirements are stability, compactness, and A-provability (meaning that the first player Alice knows how to correctly answer claims). It’s important that A-probability is a requirement, as otherwise you can do silly things like lifting up to multilinear extensions of your problem over finite fields, and then there will always been lots of independent evidence which can be turned into stability.

Geoffrey Irving 18 Jun 2025 7:18 UTC
LW: 2 AF: 2
0
AF
in reply to: Beth Barnes’s comment on: Prover-Estimator Debate: A New Scalable Oversight Protocol
I agree with this! On the empirical side, we’re hoping to both get more human participant experiments to happen around debate, and to build more datasets that try to probe obfuscated arguments. The dataset aspect is important, as I think in the years since the original paper follow-on scalable oversight experiments (debate or not) have been too underpowered in various ways to detect the problem, which then results in insufficient empirical work getting into the details.

Geoffrey Irving 18 Jun 2025 7:14 UTC
LW: 3 AF: 3
0
AF
in reply to: Charlie Steiner’s comment on: Prover-Estimator Debate: A New Scalable Oversight Protocol
One way to think about amplification or debate is that they’re methods for accelerated evaluation of large computations: instead of letting the debaters choose where in the computation to branch, you could just take all branches and do the full exponential work. Then safety splits into

1. Are all perturbations of the unaccelerated computation safe?
2. If we train for debate, do we get one of those?

If humans are systematically biased, this can break (1) before we get to (2). It may still be possible to shift some of the load from the unaccelerated computation to the protocol by finding protocols that are robust to some classes of systematic error (this post discusses that). This is a big issue, and one where we’ll be trying to get more work to happen. A particular case is that many organisations are planning to use scalable oversight for automated safety research, and people love to be optimistic that new safety schemes might work.

Geoffrey Irving 17 Jun 2025 15:59 UTC
LW: 14 AF: 7
0
AF
on: Prover-Estimator Debate: A New Scalable Oversight Protocol
On the AISI side, we would very excited to collaborate on further research! If you’re interested in collaborating with UK AISI, you can express interest here. If you’re a non-profit or academic, you can also apply for grants up to £200,000, from UK AISI directly here.

Prover-Estimator Debate: A New Scalable Oversight Protocol

Jonah Brown-Cohen and Geoffrey Irving

17 Jun 2025 13:53 UTC

89 points

19 comments5 min readLW link

Geoffrey Irving 29 May 2025 8:42 UTC
LW: 1 AF: 1
0
AF
in reply to: Wei Dai’s comment on: An alignment safety case sketch based on debate
Continuing with the Newtonian physics analogy, the case for optimism would be:

1. We have some theories with limited domain of applicability. Say, theory A.
2. Theory A is wrong at some limit, where it is replaced by theory B. Theory B is still wrong, but it has a larger domain of applicability.
3. We don’t know theory B, and can’t access it despite our best scalable oversight techniques, even though the AIs do figure out theory B. (This is the hard case: I think there other cases where scalable oversight does work.)
4. However, we do have some purchase on the domain of applicability of theory A: we know the limits of where it’s been tested (energy levels, length scales, etc.).
5. Scalable oversight has an easier job talking about these limits to theory A than it doesn’t about theory B itself. Concretely, what this means is that you can express arguments like “theory A doesn’t resolve question Q, as the answer depends on applying theory A beyond it’s decent-confidence domain of applicability”.
6. Profit.

This gives you a capability cap: the AIs know theory B but you can’t use it. But I do think if you can pull off the necessary restriction to which questions you can answer you can muddle through, even if you know only theory A and have some sense of its limits. The limits of Newtonian physics started to appear long before the replacement theories (relativity and quantum). I think we’re in a similar place with the philosophical worries: we have both a bunch of specific games that fail with older theories, and a bunch of proposals (say, variants of FDT) without a clear winner.

The additional big thing you need here is a property of the world that makes that capability cap okay: if the only way to succeed is find perfect solutions using theory B, say because that gives you a necessary edge in an adversarial competition between multiple AIs, then lacking theory B sinks you. But I think we have a shot about not being in the worst case here.

(Sorry as well for delay! Was sick.)

Unexploitable search: blocking malicious use of free parameters

Jacob Pfau and Geoffrey Irving

21 May 2025 17:23 UTC

40 points

16 comments6 min readLW link

Geoffrey Irving 15 May 2025 9:37 UTC
LW: 5 AF: 5
0
AF
in reply to: Wei Dai’s comment on: An alignment safety case sketch based on debate
The Dodging systematic human errors in scalable oversight post is out as you saw, we can mostly take the conversation over there. But briefly, I think I’m mostly just more bullish on the margin than you about the (1) the probability that we can in fact make purchase on the hard philosophy, should that be necessary and (2) the utility we can get out of solving other problems should the hard philosophy problems remain unsolved. The goal with the dodging human errors post would be that if fail at case (1), we’re more likely to recognise it and try to get utility out of (2) on other questions.

Part of this is that my mental model of formalisations standing the test of time is that we do have a lot of these: both of the links you point to are formalisations that have stood the test of time and have some reasonable domain of applicability in which they say useful things. I agree they aren’t bulletproof, but I think I’d place more chance than you of muddling through with imperfect machinery. This is similar to physics: I would argue for example that Newtonian physics has stood the test of time even though it is wrong, as it still applies across a large domain of applicability.

That said, I’m not all confident in this picture: I’d place a lower probability than you on these considerations biting, but not that low.

Geoffrey Irving 15 May 2025 9:29 UTC
LW: 4 AF: 4
0
AF
in reply to: Wei Dai’s comment on: Dodging systematic human errors in scalable oversight
I think there are roughly two things you can do:
1. In some cases, we will be able to get more accurate answers if we spend more resources (teams of people with more expertise taking more time, etc.). If we can do that, and we know μ (which is hard), we can get some purchase on ε.
2. We set tune ε not based on what’s safe, but based on what is competitive. I.e., we want to solve some particular task domain (AI safety research or the like), and we increase ε until it starts to break making progress, then dial it back a bit. This option isn’t amazing, but I do think is a move we’ll have a take for a bunch of safety parameters, assuming there are parameters which have some capability cost.

Dodging systematic human errors in scalable oversight

Geoffrey Irving14 May 2025 15:19 UTC

34 points

4 comments4 min readLW link

Geoffrey Irving 9 May 2025 12:26 UTC
LW: 7 AF: 6
0
AF
in reply to: Wei Dai’s comment on: An alignment safety case sketch based on debate
I broadly agree with these concerns. I think we can split it into (1) the general issue of AGI/ASI driving humans out of distribution and (2) the specific issue of how assumptions about human data quality as used in debate will break down. For (2), we’ll have a short doc soon (next week or so) which is somewhat related, along the lines of “assume humans are right most of the time on a natural distribution, and search for protocols which report uncertainty if the distribution induced by a debate protocol on some new class of questions is sufficiently different”. Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there’s no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn’t competitive.
One general note is that scalable oversight is a method for accelerating an intractable computation built out of tractable components, and these components can include both human and conventional software. So if you understand the domain somewhat well, you can try mitigate failures of (2) (and potentially gain more traction on (1)) by formalising part of the domain. And this formalisation can be bootstrapped: you can use on-distribution human data to check specifications, and then use those specifications (code, proofs, etc.) in order to rely on human queries for a smaller portion of the over next-stage computation. But generally this requires you to have some formal purchase on the philosophical aspects where humans are off distribution, which may be rough.
What links here?
- Wei Dai's comment on An alignment safety case sketch based on debate by Marie_DB (11 May 2025 21:23 UTC; 3 points)

Geoffrey Irving 9 May 2025 12:12 UTC
LW: 6 AF: 3
0
AF
in reply to: Charlie Steiner’s comment on: UK AISI’s Alignment Team: Research Agenda
These buckets seem reasonable, and +1 to it being important that some of the resulting ideas are independent of debate. In particular on the inner alignment this exercise (1) made me think exploration hacking might be a larger fraction of the problem than I had thought before, which is encouraging as it might be tractable, but (2) there may be an opening for learning theory that tries to say something about residual error along the lines of https://x.com/geoffreyirving/status/1920554467558105454.

On the systematic human error front, we’ll put out a short post on that soon (next week or so), but broadly the framing is to start with a computation which consults humans, and instead of assuming the humans are have unbiased error instead assume that the humans are wrong on some unknown ε-fraction of queries w.r.t. to some distribution. You can then try to change the debate protocol so that it detects if you can choose the ε-fraction to flip the answer, and report uncertainty in this case. This still requires you to make some big assumption about humans, but is a weaker assumption, and leads to specific ideas for protocol changes.

Geoffrey Irving

Bring­ing More Ex­per­tise to Bear on Alignment

Re­search Areas in Cog­ni­tive Science (The Align­ment Pro­ject by UK AISI)

The Align­ment Pro­ject by UK AISI

The need to rel­a­tivise in de­bate

Prover-Es­ti­ma­tor De­bate: A New Scal­able Over­sight Protocol

Un­ex­ploitable search: block­ing mal­i­cious use of free parameters

Dodg­ing sys­tem­atic hu­man er­rors in scal­able oversight

Bringing More Expertise to Bear on Alignment

Research Areas in Cognitive Science (The Alignment Project by UK AISI)

The Alignment Project by UK AISI

The need to relativise in debate

Prover-Estimator Debate: A New Scalable Oversight Protocol

Unexploitable search: blocking malicious use of free parameters

Dodging systematic human errors in scalable oversight