One way to think about amplification or debate is that they’re methods for accelerated evaluation of large computations: instead of letting the debaters choose where in the computation to branch, you could just take all branches and do the full exponential work. Then safety splits into
1. Are all perturbations of the unaccelerated computation safe? 2. If we train for debate, do we get one of those?
If humans are systematically biased, this can break (1) before we get to (2). It may still be possible to shift some of the load from the unaccelerated computation to the protocol by finding protocols that are robust to some classes of systematic error (this post discusses that). This is a big issue, and one where we’ll be trying to get more work to happen. A particular case is that many organisations are planning to use scalable oversight for automated safety research, and people love to be optimistic that new safety schemes might work.
One way to think about amplification or debate is that they’re methods for accelerated evaluation of large computations: instead of letting the debaters choose where in the computation to branch, you could just take all branches and do the full exponential work. Then safety splits into
1. Are all perturbations of the unaccelerated computation safe?
2. If we train for debate, do we get one of those?
If humans are systematically biased, this can break (1) before we get to (2). It may still be possible to shift some of the load from the unaccelerated computation to the protocol by finding protocols that are robust to some classes of systematic error (this post discusses that). This is a big issue, and one where we’ll be trying to get more work to happen. A particular case is that many organisations are planning to use scalable oversight for automated safety research, and people love to be optimistic that new safety schemes might work.