Someone thought it would be useful to quickly write up a note on my thoughts on scalable oversight research, e.g., research into techniques like debate or generally improving the quality of human oversight using AI assistance or other methods.
Broadly, my view is that this is a good research direction and I’m reasonably optimistic that work along these lines can improve our ability to effectively oversee somewhat smarter AIs which seems helpful (on my views about how the future will go).
I’m most excited for:
work using control-style adversarial analysis where the aim is to make it difficult for AIs to subvert the oversight process (if they were trying to do this)
work which tries to improve outputs in conceptually loaded hard-to-check cases like philosophy, strategy, or conceptual alignment/safety research (without necessarily doing any adversarial analysis and potentially via relying on generalization)
work which aims to robustly detect (or otherwise mitigate) reward hacking in highly capable AIs, particularly AIs which are capable enough that by default human oversight would often fail to detect reward hacks[1]
I’m skeptical of scalable oversight style methods (e.g., debate, IDA) actually being “scalable” in the sense of scaling to arbitrarily powerful models[2] and I think scalable oversight researchers should broadly be imagining targeting AIs at a human-ish or somewhat superhuman level of general capabilities (while they might still be very superhuman in narrower domains).
In other words, I think scalable oversight style work should focus on a regime like the regime we’re imagining targeting with AI control; this could be for controlling AIs, for getting more safety work out of AIs, or for making fully deferring to AI systems (at around this level of capability) more likely to go well.
See also our prior work Benchmarks for Detecting Measurement Tampering and the motivation we discuss in that linked post as well as this related project proposal I recently wrote. However, note that the linked documents mostly discuss using (sophisticated) methods relying on model internals to succeed without much human supervision which isn’t the sort of thing I’d most centrally call “scalable oversight”, though the term could be applied in this case.
On terminology, I prefer to say “recursive oversight” to refer to methods that leverage assistance from weaker AIs to oversee stronger AIs. IDA is a central example here. Like you, I’m skeptical of recursive oversight schemes scaling to arbitrarily powerful models.
However, I think it’s plausible that other oversight strategies (e.g. ELK-style strategies that attempt to elicit and leverage the strong learner’s own knowledge) could succeed at scaling to arbitrarily powerful models, or at least to substantially superhuman models. This is the regime that I typically think about and target with my work, and I think it’s reasonable for others to do so as well.
Presumably the term “recursive oversight” also includes oversight schemes which leverage assistance from AIs of similar strengths (rather than weaker AIs) to oversee some AI? (E.g., debate, recursive reward modeling.)
Note that I was pointing to a somewhat broader category than this which includes stuff like “training your human overseers more effectively” or “giving your human overseers better software (non-AI) tools”. But point taken.
Yeah, maybe I should have defined “recursive oversight” as “techniques that attempt to bootstrap from weak oversight to stronger oversight.” This would include IDA and task decomposition approaches (e.g. RRM). It wouldn’t seem to include debate, and that seems fine from my perspective. (And I indeed find it plausible that debate-shaped approaches could in fact scale arbitrarily, though I don’t think that existing debate schemes are likely to work without substantial new ideas.)
Someone thought it would be useful to quickly write up a note on my thoughts on scalable oversight research, e.g., research into techniques like debate or generally improving the quality of human oversight using AI assistance or other methods. Broadly, my view is that this is a good research direction and I’m reasonably optimistic that work along these lines can improve our ability to effectively oversee somewhat smarter AIs which seems helpful (on my views about how the future will go).
I’m most excited for:
work using control-style adversarial analysis where the aim is to make it difficult for AIs to subvert the oversight process (if they were trying to do this)
work which tries to improve outputs in conceptually loaded hard-to-check cases like philosophy, strategy, or conceptual alignment/safety research (without necessarily doing any adversarial analysis and potentially via relying on generalization)
work which aims to robustly detect (or otherwise mitigate) reward hacking in highly capable AIs, particularly AIs which are capable enough that by default human oversight would often fail to detect reward hacks[1]
I’m skeptical of scalable oversight style methods (e.g., debate, IDA) actually being “scalable” in the sense of scaling to arbitrarily powerful models[2] and I think scalable oversight researchers should broadly be imagining targeting AIs at a human-ish or somewhat superhuman level of general capabilities (while they might still be very superhuman in narrower domains). In other words, I think scalable oversight style work should focus on a regime like the regime we’re imagining targeting with AI control; this could be for controlling AIs, for getting more safety work out of AIs, or for making fully deferring to AI systems (at around this level of capability) more likely to go well.
See also our prior work Benchmarks for Detecting Measurement Tampering and the motivation we discuss in that linked post as well as this related project proposal I recently wrote. However, note that the linked documents mostly discuss using (sophisticated) methods relying on model internals to succeed without much human supervision which isn’t the sort of thing I’d most centrally call “scalable oversight”, though the term could be applied in this case.
Because of this, I think the name isn’t ideal.
On terminology, I prefer to say “recursive oversight” to refer to methods that leverage assistance from weaker AIs to oversee stronger AIs. IDA is a central example here. Like you, I’m skeptical of recursive oversight schemes scaling to arbitrarily powerful models.
However, I think it’s plausible that other oversight strategies (e.g. ELK-style strategies that attempt to elicit and leverage the strong learner’s own knowledge) could succeed at scaling to arbitrarily powerful models, or at least to substantially superhuman models. This is the regime that I typically think about and target with my work, and I think it’s reasonable for others to do so as well.
I agree with preferring “recursive oversight”.
Presumably the term “recursive oversight” also includes oversight schemes which leverage assistance from AIs of similar strengths (rather than weaker AIs) to oversee some AI? (E.g., debate, recursive reward modeling.)
Note that I was pointing to a somewhat broader category than this which includes stuff like “training your human overseers more effectively” or “giving your human overseers better software (non-AI) tools”. But point taken.
Yeah, maybe I should have defined “recursive oversight” as “techniques that attempt to bootstrap from weak oversight to stronger oversight.” This would include IDA and task decomposition approaches (e.g. RRM). It wouldn’t seem to include debate, and that seems fine from my perspective. (And I indeed find it plausible that debate-shaped approaches could in fact scale arbitrarily, though I don’t think that existing debate schemes are likely to work without substantial new ideas.)