Cool idea! I like this idea as a cost-reduction technique. It is similar to cost-reduction techniques people use when there is a human in the loop (e.g. see Docent), and I think it’s a cool idea to apply this to a pair of LLMs!
From a quick read, I did not understand why the overall performance of extract-and-evaluate is higher than CoT+action monitors, but when you break things down by attack type, CoT+monitor does much better for long impl and extract-and-evaluate does much better for edge cases, with performance of both being high for other attacks. Is it that you have way more “edge cases” attacks?
I am also unsure why you think this should work better than a scratchpad-based monitor (besides cost reduction). If you have a scratchpad, a model that uses its scratchpad well enough (which it should if it was RLed on monitoring-like tasks) should be able to learn to do the extraction and then the evaluation? Is the worry that the model would get distracted by other content?
On the performance of extract-and-evaluate vs CoT+action monitors: Sorry, we used the same color scheme to mean different things in different plots, this was probably a bad design choice! The plot where we break things down by task shows only extract-and-evaluate catch rates, and the colors represent different attacker models being monitored. The performance of the CoT+action monitor would be substantially lower across almost all side tasks and attacker models. I changed the last two plots in the post and hope this is clearer now.
On why this is better than a scratchpad-based monitor: CoT+Action monitors often fail by noticing the malicious behavior but then assigning a low suspicion score anyway, e.g. because they rationalize the side task as being a part of the agent’s instructions and thus benign. Extract-and-evaluate solves this issue in most cases and that seems to be the main source of the improvement in catch rates. We think that this might be solved by using more capable monitors, but in the meantime, it seems important to apply elicitation methods like extract-and-evaluate to avoid presenting an overly pessimistic outlook on the usefulness of CoT monitoring. We also think that even much more capable monitors might get distracted when the attackers become better at producing convincing post-hoc justifications or when transcripts get extremely long, so extract-and-evaluate may also be a relevant technique to try in those cases.
Cool idea! I like this idea as a cost-reduction technique. It is similar to cost-reduction techniques people use when there is a human in the loop (e.g. see Docent), and I think it’s a cool idea to apply this to a pair of LLMs!
From a quick read, I did not understand why the overall performance of extract-and-evaluate is higher than CoT+action monitors, but when you break things down by attack type, CoT+monitor does much better for long impl and extract-and-evaluate does much better for edge cases, with performance of both being high for other attacks. Is it that you have way more “edge cases” attacks?
I am also unsure why you think this should work better than a scratchpad-based monitor (besides cost reduction). If you have a scratchpad, a model that uses its scratchpad well enough (which it should if it was RLed on monitoring-like tasks) should be able to learn to do the extraction and then the evaluation? Is the worry that the model would get distracted by other content?
Thanks for the feedback!
On the performance of extract-and-evaluate vs CoT+action monitors: Sorry, we used the same color scheme to mean different things in different plots, this was probably a bad design choice! The plot where we break things down by task shows only extract-and-evaluate catch rates, and the colors represent different attacker models being monitored. The performance of the CoT+action monitor would be substantially lower across almost all side tasks and attacker models. I changed the last two plots in the post and hope this is clearer now.
On why this is better than a scratchpad-based monitor: CoT+Action monitors often fail by noticing the malicious behavior but then assigning a low suspicion score anyway, e.g. because they rationalize the side task as being a part of the agent’s instructions and thus benign. Extract-and-evaluate solves this issue in most cases and that seems to be the main source of the improvement in catch rates. We think that this might be solved by using more capable monitors, but in the meantime, it seems important to apply elicitation methods like extract-and-evaluate to avoid presenting an overly pessimistic outlook on the usefulness of CoT monitoring. We also think that even much more capable monitors might get distracted when the attackers become better at producing convincing post-hoc justifications or when transcripts get extremely long, so extract-and-evaluate may also be a relevant technique to try in those cases.