On the performance of extract-and-evaluate vs CoT+action monitors: Sorry, we used the same color scheme to mean different things in different plots, this was probably a bad design choice! The plot where we break things down by task shows only extract-and-evaluate catch rates, and the colors represent different attacker models being monitored. The performance of the CoT+action monitor would be substantially lower across almost all side tasks and attacker models. I changed the last two plots in the post and hope this is clearer now.
On why this is better than a scratchpad-based monitor: CoT+Action monitors often fail by noticing the malicious behavior but then assigning a low suspicion score anyway, e.g. because they rationalize the side task as being a part of the agent’s instructions and thus benign. Extract-and-evaluate solves this issue in most cases and that seems to be the main source of the improvement in catch rates. We think that this might be solved by using more capable monitors, but in the meantime, it seems important to apply elicitation methods like extract-and-evaluate to avoid presenting an overly pessimistic outlook on the usefulness of CoT monitoring. We also think that even much more capable monitors might get distracted when the attackers become better at producing convincing post-hoc justifications or when transcripts get extremely long, so extract-and-evaluate may also be a relevant technique to try in those cases.
Thanks for the feedback!
On the performance of extract-and-evaluate vs CoT+action monitors: Sorry, we used the same color scheme to mean different things in different plots, this was probably a bad design choice! The plot where we break things down by task shows only extract-and-evaluate catch rates, and the colors represent different attacker models being monitored. The performance of the CoT+action monitor would be substantially lower across almost all side tasks and attacker models. I changed the last two plots in the post and hope this is clearer now.
On why this is better than a scratchpad-based monitor: CoT+Action monitors often fail by noticing the malicious behavior but then assigning a low suspicion score anyway, e.g. because they rationalize the side task as being a part of the agent’s instructions and thus benign. Extract-and-evaluate solves this issue in most cases and that seems to be the main source of the improvement in catch rates. We think that this might be solved by using more capable monitors, but in the meantime, it seems important to apply elicitation methods like extract-and-evaluate to avoid presenting an overly pessimistic outlook on the usefulness of CoT monitoring. We also think that even much more capable monitors might get distracted when the attackers become better at producing convincing post-hoc justifications or when transcripts get extremely long, so extract-and-evaluate may also be a relevant technique to try in those cases.