I was really impressed by the technical work in this paper. Getting to a formalization of the problem setup and the protocol that allows you to prove meaningful things is a big accomplishment.
However, as the authors mention above, I don’t think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight. (I think the discussion in this post is fine, but I think the title of the paper “avoiding obfuscation with prover-estimator debate” is a bit misleading. I believe the authors are going to change this in v2.)
I’m excited about more empirical work on making debate protocols work in practice. I feel a bit less excited about pure theory work, but I think work that mixes experimentation and theory could be helpful.
I think there are broadly two classes of hope about obfuscated arguments:
(1.) In practice, obfuscated argument problems rarely come up, due to one of:
It’s difficult in practice to construct obfuscated arguments for arbitrary propositions
It’s definitely plausible to me that the algorithm sketch I gave here for constructing obfuscated arguments for arbitrary propositions doesn’t work, given some relatively weak assumptions about the structure of the debaters’ knowledge
For anything the debaters know, there’s usually a sufficiently compact and stable honest argument such that the honest debater can win by giving this argument
It seems pretty easy to give counterexamples here—e.g. Paul gives the example of unsupervised translation. But maybe these are relatively rare. It’s plausible that the honest debaters in the human experiments I did where we ran into obfuscation / instability problems weren’t using the right tactics (i.e. they could have constructed more stable arguments that used independent lines of evidence more)
(2.) We can create a protocol that distinguishes between cases where:
(not obfuscatable) the debate mirrors the structure of the debaters’ understanding such that they would be able to easily identify which subtree contains a flaw if one was present
(obfuscatable) they don’t or wouldn’t know which subtree contains the flaw.
This is the way (apart from argument size) in which the primality test example differs from the obfuscated factorization example: the debaters have some high-level mathematical concepts which allow them to quickly determine whether some proposed lemma is correct.
This wouldn’t get us to full ELK (bc maybe models still know things they have no human-understandable arguments for), but would at least expand the class of honest arguments that we can trust to include ones that are large + unstable in human-understandable form but where the debaters do have a faster way of identifying which subtree to go down.
I agree with this! On the empirical side, we’re hoping to both get more human participant experiments to happen around debate, and to build more datasets that try to probe obfuscated arguments. The dataset aspect is important, as I think in the years since the original paper follow-on scalable oversight experiments (debate or not) have been too underpowered in various ways to detect the problem, which then results in insufficient empirical work getting into the details.
Yep. For empirical work I’m in favor of experiments with more informed + well-trained human judges who engage deeply etc, and having a high standard for efficacy (e.g. “did it get the correct answer with very high reliability”) as opposed to “did it outperform a baseline by a statistically significant margin” where you then end up needing high n and therefore each example needs to be cheap / shallow
I would love the two of you (Beth and @Jacob Pfau) to talk about this in detail, if you’re up for it! Getting the experimental design right is key is we want to get more human participant experiments going and learn from them. The specific point of “have a high standard for efficacy” was something I was emphasising to Jacob a few weeks ago as having distinguished your experiments from some of the follow-ons.
I don’t think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight.
Can you elaborate on why you think this? Perhaps you were already aware that obfuscated arguments should only pose a problem for completeness (in which case for the majority of obfuscated-arguments-aware people I’d imagine the new work should be a significant update.) Or perhaps your concern is that in the soundness case the refutation is actually inefficient and the existence of small Bob circuits demonstrated in this paper is too weak to be practically relevant?
Regarding (1.2) I also see this a priority to study. Basically, we already have a strong prior from interacting with current models that LMs will continue to generalise usably from verifiable domains to looser questions. So, the practical relevance of debate hinges on how expensive it is to train with debate to increase the reliability of this generalisation. I have in mind here questions like “What experiment should I run next to test this debate protocol”.
One dataset idea to assess how often stable arguments can be found is to curate ‘proof advice’ problems. These problems are proxies for research advice in general. Basically:
Question: “Does this textbook use theorem Y to prove theorem X?”, where the debaters see the proof but the judge does not.
(Note that this is not about proof correctness, a dishonest debater can offer up alternative correct proofs that use non-standard techniques.) This setup forces arguments based on heuristics and “research taste”—what makes a proof natural or clear. It’s unclear what independent evidence would look like in this domain.
I can write more later, but here’s a relevant doc I wrote as part of discussion with Geoffrey + others. Maybe the key point from there is that I don’t think this protocol solves the examples given in the original post describing obfuscated arguments. But yeah, I was always thinking this was a completeness problem (the original post poses the problem as distinguishing a certain class of honest arguments from dishonest obfuscated arguments—not claiming you can’t get soundness by just ignoring any arguments that are plausibly obfuscated.)
I was really impressed by the technical work in this paper. Getting to a formalization of the problem setup and the protocol that allows you to prove meaningful things is a big accomplishment.
However, as the authors mention above, I don’t think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight. (I think the discussion in this post is fine, but I think the title of the paper “avoiding obfuscation with prover-estimator debate” is a bit misleading. I believe the authors are going to change this in v2.)
I’m excited about more empirical work on making debate protocols work in practice. I feel a bit less excited about pure theory work, but I think work that mixes experimentation and theory could be helpful.
I think there are broadly two classes of hope about obfuscated arguments:
(1.) In practice, obfuscated argument problems rarely come up, due to one of:
It’s difficult in practice to construct obfuscated arguments for arbitrary propositions
It’s definitely plausible to me that the algorithm sketch I gave here for constructing obfuscated arguments for arbitrary propositions doesn’t work, given some relatively weak assumptions about the structure of the debaters’ knowledge
For anything the debaters know, there’s usually a sufficiently compact and stable honest argument such that the honest debater can win by giving this argument
It seems pretty easy to give counterexamples here—e.g. Paul gives the example of unsupervised translation. But maybe these are relatively rare. It’s plausible that the honest debaters in the human experiments I did where we ran into obfuscation / instability problems weren’t using the right tactics (i.e. they could have constructed more stable arguments that used independent lines of evidence more)
(2.) We can create a protocol that distinguishes between cases where:
(not obfuscatable) the debate mirrors the structure of the debaters’ understanding such that they would be able to easily identify which subtree contains a flaw if one was present
(obfuscatable) they don’t or wouldn’t know which subtree contains the flaw.
This is the way (apart from argument size) in which the primality test example differs from the obfuscated factorization example: the debaters have some high-level mathematical concepts which allow them to quickly determine whether some proposed lemma is correct.
This wouldn’t get us to full ELK (bc maybe models still know things they have no human-understandable arguments for), but would at least expand the class of honest arguments that we can trust to include ones that are large + unstable in human-understandable form but where the debaters do have a faster way of identifying which subtree to go down.
I agree with this! On the empirical side, we’re hoping to both get more human participant experiments to happen around debate, and to build more datasets that try to probe obfuscated arguments. The dataset aspect is important, as I think in the years since the original paper follow-on scalable oversight experiments (debate or not) have been too underpowered in various ways to detect the problem, which then results in insufficient empirical work getting into the details.
Yep. For empirical work I’m in favor of experiments with more informed + well-trained human judges who engage deeply etc, and having a high standard for efficacy (e.g. “did it get the correct answer with very high reliability”) as opposed to “did it outperform a baseline by a statistically significant margin” where you then end up needing high n and therefore each example needs to be cheap / shallow
I would love the two of you (Beth and @Jacob Pfau) to talk about this in detail, if you’re up for it! Getting the experimental design right is key is we want to get more human participant experiments going and learn from them. The specific point of “have a high standard for efficacy” was something I was emphasising to Jacob a few weeks ago as having distinguished your experiments from some of the follow-ons.
Yep, happy to chat!
Can you elaborate on why you think this? Perhaps you were already aware that obfuscated arguments should only pose a problem for completeness (in which case for the majority of obfuscated-arguments-aware people I’d imagine the new work should be a significant update.) Or perhaps your concern is that in the soundness case the refutation is actually inefficient and the existence of small Bob circuits demonstrated in this paper is too weak to be practically relevant?
Regarding (1.2) I also see this a priority to study. Basically, we already have a strong prior from interacting with current models that LMs will continue to generalise usably from verifiable domains to looser questions. So, the practical relevance of debate hinges on how expensive it is to train with debate to increase the reliability of this generalisation. I have in mind here questions like “What experiment should I run next to test this debate protocol”.
One dataset idea to assess how often stable arguments can be found is to curate ‘proof advice’ problems. These problems are proxies for research advice in general. Basically:
(Note that this is not about proof correctness, a dishonest debater can offer up alternative correct proofs that use non-standard techniques.) This setup forces arguments based on heuristics and “research taste”—what makes a proof natural or clear. It’s unclear what independent evidence would look like in this domain.
I can write more later, but here’s a relevant doc I wrote as part of discussion with Geoffrey + others. Maybe the key point from there is that I don’t think this protocol solves the examples given in the original post describing obfuscated arguments. But yeah, I was always thinking this was a completeness problem (the original post poses the problem as distinguishing a certain class of honest arguments from dishonest obfuscated arguments—not claiming you can’t get soundness by just ignoring any arguments that are plausibly obfuscated.)