Thanks for featuring our work! I’d like to clarify a few points, which I think each share some top-level similarities: our study is study of protocols as inference-only (which is cheap and quick to study, possibly indicative) whereas what we care more about it protocols for training (which is much more expensive, and will take longer to study) which was out of scope for this work, though we intend to look at that next based on our findings—e.g. we have learnt that some domains are easier to work with than others, some baseline protocols are more meaningful/easier to interpret. In my opinion this is time well-spent to avoid spending lots more money and time on rushing into finetuning but with a bad setup.
The paper does not discuss compute costs. Which is odd, since to me that seems like the central thing you are doing?
Claude estimates that compared to asking the question directly, using the article is a 1.2x-1.5x compute cost. If you use advanced techniques, then if the models had similar costs the cost would be 6x-8x for consultancy, 8x-10x for debate and 7x-11x for open versions, times N if you do best-of-N. Then you have to multiply again because the consultants and debaters are larger more expensive models.
I haven’t carefully thought through these estimates (especially the use of an article, which to me seems to depend largely on the article length), but it looks like you’re considering the inference costs. In the eventual use-case of using scalable oversight for training/finetuning, the cost of training is amortised. Typical usage would then be sample once from the finetuned model (as the hope is that the training incentives initial response eg for truthfulness. You could play out the whole debate if you want to at deployment,, e.g. for monitoring, but not necessary in general). It would be more appropriate to calculate finetune costs, as we don’t think there is much advantage to using these as inference procedures. We’ll be in a better position to estimate that in the next project.
And of course, given that we know Gemini 1.5 Pro is not misaligned or deceptive, there is every expectation that any strategy by Gemma other than ‘trust Gemini 1.5’s answer’ is going to make it score worse.
Actually, in theory at least, one should be able to do better even without models being explicitly misaligned/deceptive (that is the hope of debate over other protocols like consultancy, after all). We think our work is interesting because it provides some mixed results on how that works in a particular empirical setup, though clearly limited by inference-only.
So what have we learned about scalable oversight? It seems like this setup sidesteps the actual problems?
Instead I would say it implicitly highlights the problem that it is extraordinarily difficult to get the judge to do better than trusting the stronger models, a strategy which then breaks down catastrophically when you need the judge the most.
This is probably too strong a claim—we’ve tried to highlight our results are relatively mixed on the outcomes of the protocols, and are limited by being inference-only.
Thanks for featuring our work! I’d like to clarify a few points, which I think each share some top-level similarities: our study is study of protocols as inference-only (which is cheap and quick to study, possibly indicative) whereas what we care more about it protocols for training (which is much more expensive, and will take longer to study) which was out of scope for this work, though we intend to look at that next based on our findings—e.g. we have learnt that some domains are easier to work with than others, some baseline protocols are more meaningful/easier to interpret. In my opinion this is time well-spent to avoid spending lots more money and time on rushing into finetuning but with a bad setup.
I haven’t carefully thought through these estimates (especially the use of an article, which to me seems to depend largely on the article length), but it looks like you’re considering the inference costs. In the eventual use-case of using scalable oversight for training/finetuning, the cost of training is amortised. Typical usage would then be sample once from the finetuned model (as the hope is that the training incentives initial response eg for truthfulness. You could play out the whole debate if you want to at deployment,, e.g. for monitoring, but not necessary in general). It would be more appropriate to calculate finetune costs, as we don’t think there is much advantage to using these as inference procedures. We’ll be in a better position to estimate that in the next project.
Actually, in theory at least, one should be able to do better even without models being explicitly misaligned/deceptive (that is the hope of debate over other protocols like consultancy, after all). We think our work is interesting because it provides some mixed results on how that works in a particular empirical setup, though clearly limited by inference-only.
This is probably too strong a claim—we’ve tried to highlight our results are relatively mixed on the outcomes of the protocols, and are limited by being inference-only.