Glad you published this post – very interesting work!
Looking at the monitor prompt in the appendix, one concern I have is that it seems pretty unrealistic that the monitor would be given (a large part of) the exact misalignment-prompt given to the main policy. Instead, I’d imagine a realistic monitoring setting to involve a pretty generic prompt to look for signs of misalignment and suspicious behavior, which include backdoors but would probably include many other bad behaviors. Do you think this may be unrealistically favoring the monitors in your experiments?
Relatedly, aren’t you worried of overestimating the difficulty of subtle reasoning because of your choice of domain? Backdoor insertion seems more easily verifiable/monitorable than other behaviors which have more plausible deniability – as an example, it seems potentially much harder to detect sandbagging using a monitor once the explicit choice to sandbag has become obfuscated.
As a last question, you say that “while we interpret certain behaviors as persuasion based on the model justifying obvious backdoors, we cannot definitively rule out that the model is performing some other form of cognitive work.”. Doesn’t the fact that the subplot in Figure 2 (row 2, column 2) is increasing with more reasoning tokens (from about maybe ~70% to ~80%) suggest that some stenographic cognitive work is happening in these “persuasion CoTs”, or at the very least the single-forward-pass backdoor capabilities of the policy are increasing over the course of training? Curious if you have reasons to discount this possible explanation!
Glad you published this post – very interesting work!
Looking at the monitor prompt in the appendix, one concern I have is that it seems pretty unrealistic that the monitor would be given (a large part of) the exact misalignment-prompt given to the main policy. Instead, I’d imagine a realistic monitoring setting to involve a pretty generic prompt to look for signs of misalignment and suspicious behavior, which include backdoors but would probably include many other bad behaviors. Do you think this may be unrealistically favoring the monitors in your experiments?
Relatedly, aren’t you worried of overestimating the difficulty of subtle reasoning because of your choice of domain? Backdoor insertion seems more easily verifiable/monitorable than other behaviors which have more plausible deniability – as an example, it seems potentially much harder to detect sandbagging using a monitor once the explicit choice to sandbag has become obfuscated.
As a last question, you say that “while we interpret certain behaviors as persuasion based on the model justifying obvious backdoors, we cannot definitively rule out that the model is performing some other form of cognitive work.”. Doesn’t the fact that the subplot in Figure 2 (row 2, column 2) is increasing with more reasoning tokens (from about maybe ~70% to ~80%) suggest that some stenographic cognitive work is happening in these “persuasion CoTs”, or at the very least the single-forward-pass backdoor capabilities of the policy are increasing over the course of training? Curious if you have reasons to discount this possible explanation!