but I don’t think AI Safety via Debate presupposes an AI being motivated by the training signal
This seems right to me.
I often imagine debate (and similar techniques) being applied in the (low-stakes/average-case/non-concentrate) control setting. The control setting is the case where you are maximally conservative about the AI’s motivations and then try to demonstrate safety via making the incapable of causing catastrophic harm.
If you make pessimistic assumptions about AI motivations like this, then you have to worry about concerns like exploration hacking (or even gradient hacking), but it’s still plausible that debate adds considerable value regardless.
We could also less conservatively assume that AIs might be misaligned (including seriously misaligned with problematic long range goals), but won’t necessarily prefer colluding with other AI over working with humans (e.g. because humanity offers payment for labor). In this case, techniques like debate seem quite applicable and the situation could be very dangerous in the absence of good enough approaches (at least if AIs are quite superhuman).
More generally, debate could be applicable to any type of misalignment which you think might cause problems over a large number of independently assessible actions.
This seems right to me.
I often imagine debate (and similar techniques) being applied in the (low-stakes/average-case/non-concentrate) control setting. The control setting is the case where you are maximally conservative about the AI’s motivations and then try to demonstrate safety via making the incapable of causing catastrophic harm.
If you make pessimistic assumptions about AI motivations like this, then you have to worry about concerns like exploration hacking (or even gradient hacking), but it’s still plausible that debate adds considerable value regardless.
We could also less conservatively assume that AIs might be misaligned (including seriously misaligned with problematic long range goals), but won’t necessarily prefer colluding with other AI over working with humans (e.g. because humanity offers payment for labor). In this case, techniques like debate seem quite applicable and the situation could be very dangerous in the absence of good enough approaches (at least if AIs are quite superhuman).
More generally, debate could be applicable to any type of misalignment which you think might cause problems over a large number of independently assessible actions.