This seems to imply that your position that it would be easier for models to evade monitors when there is “complex reasoning for many outputs” than when there is “complex reasoning for a single output”. Is that correct? If so, why do you think that?
No, my position is that it’s probably harder for models to evade monitors when there is “complex reasoning for many outputs”. Sorry, that wasn’t written clearly—that was meant to say that my top comment didn’t draw the distinction between tasks with a single output and tasks with many outputs (and instead only mentioned examples of other tasks with single outputs) because the paper tests settings closer to single outputs.
The last sentence was intended to be about my top comment in light of that: if the stated conclusion had been something like “difficult bad behavior is easy to monitor when models are ambivalent about their reasoning being monitored” or “difficult bad behavior is easy to monitor for current models in tasks requiring complex reasoning for many outputs regardless of propensity”, I would have agreed[1]. But the stated conclusion is instead that difficult bad behaviors are generally easy to monitor, which I don’t think the results are sufficient to imply.
Okay, that position seems to me equivalent to the position that models will be more capable at evading monitors when they intrinsically want to evade monitors, than when they are explicitly prompted to evade monitors. Do you also endorse that position? (If not, why isn’t it implied by your stance?)
That position seems pretty unlikely to me (for current models). Models are trained really hard to follow instructions in the prompt to the best of their ability.
No, my position is that it’s probably harder for models to evade monitors when there is “complex reasoning for many outputs”. Sorry, that wasn’t written clearly—that was meant to say that my top comment didn’t draw the distinction between tasks with a single output and tasks with many outputs (and instead only mentioned examples of other tasks with single outputs) because the paper tests settings closer to single outputs.
The last sentence was intended to be about my top comment in light of that: if the stated conclusion had been something like “difficult bad behavior is easy to monitor when models are ambivalent about their reasoning being monitored” or “difficult bad behavior is easy to monitor for current models in tasks requiring complex reasoning for many outputs regardless of propensity”, I would have agreed[1]. But the stated conclusion is instead that difficult bad behaviors are generally easy to monitor, which I don’t think the results are sufficient to imply.
Though in the latter case it’d probably be important to plot number of outputs vs monitor-ability and compare for models over time.
Okay, that position seems to me equivalent to the position that models will be more capable at evading monitors when they intrinsically want to evade monitors, than when they are explicitly prompted to evade monitors. Do you also endorse that position? (If not, why isn’t it implied by your stance?)
That position seems pretty unlikely to me (for current models). Models are trained really hard to follow instructions in the prompt to the best of their ability.