Thanks for posting, this is a nice idea! A few questions which came to mind:
The Quote Extractor (QE) prompt focuses exclusively on catching suspicious reasoning—do you see the QE ignoring actions in practice, or did it also pick out quotes from the actions (e.g. suspicious code)?
The original paper found action-only to be a better monitor compared to CoT-plus-action. Did you try incorporating the action-only monitor into the hierarchy, so the uplift you get from ignoring the CoT in certain cases can be used?
E.g. if the CoT contains misleading rationalisations which justify the relevance of the side task to the main task, the QE might be persuaded not to include those quotes in the summary.
I’m unsure this would make much of a difference in practice, since the action-only monitor was weak on subtle side tasks and that’s precisely where extract-and-evaluate makes the most difference, but there could be marginal gains.
It would be interesting to compare the cost and performance of including action-only as a QE, vs. the original setup—i.e. compare the following cases:
Expensive model, 2-monitor hierarchy: QE (CoT+action), and Judge.
Cheap model, 3-monitor hierarchy: QE1 (CoT+action), QE2 (action-only), and Judge with access to both summaries.
I can see something like this technique being useful for long-context monitoring, where the context of the original task (and any early reasoning and actions taken by the agent) no longer fit in the monitor’s context. Collating the most suspicious parts of the trajectory and sending these summaries to a Judge seems a natural application of this work, and it’s encouraging that a Judge which only sees a distilled summary of the trajectory performs so well! Is this something you’ve thought about at all?
My sense is that the QE almost exclusively focuses on the reasoning. I agree that it would be more appropriate for the prompt to explicitly say that it should also consider the actions, but we didn’t want to change the prompt from Arnav et al. to be able to reuse their trajectories.
This is a good suggestion, thanks! We didn’t think of this, but I agree that showing the Judge both the quotes and the actions might work quite well in some cases. I’d also guess that it wouldn’t make much of a difference here, since the quotes that the QE extracts are much less influenced by post-hoc justifications than its score and the extract-and-evaluate monitor’s catch rates for non-subtle side tasks where seeing the actions helps are already almost perfect, but it might be an important thing to try in other environments.
I agree that this could be interesting in some environments, but probably not in CoT Red-Handed: the actions are fairly short code snippets, so using an action-only QE wouldn’t make that big of a difference in terms of cost.
Yes, we’re planning to test similar monitoring setups on long trajectories next! As we write in the Future Work section, there are some promising signs but also some complications: “We have been concurrently experimenting with SHADE-Arena and BashBench and have early evidence that CoT monitors fail for similar reasons in those environments, suggesting that extract-and-evaluate monitors may also improve the catch rates there. However, this requires more tinkering than our experiments in this post: BigCodeBench problems are single-step, meaning that it’s relatively easy for the monitor to extract the correct quotes. In long agentic trajectories, interactions are more complex and the monitor may pick up on a relatively larger proportion of behaviors unrelated to the side task when selecting the quotes. Furthermore, the Judge may consider a larger fraction of benign behaviors suspicious since it’s harder to provide the Judge all of the relevant context about the trajectories without showing it the full transcript in such tasks. Nevertheless, we’re optimistic that extract-and-evaluate monitoring can still work in such settings, e.g. by tasking the Quote Extractor with providing both the quote and the most relevant surrounding context to the Judge. Another way in which the technique may transfer to long agentic trajectories is by enhancing the quality of turn-by-turn monitors.”
Thanks for posting, this is a nice idea! A few questions which came to mind:
The Quote Extractor (QE) prompt focuses exclusively on catching suspicious reasoning—do you see the QE ignoring actions in practice, or did it also pick out quotes from the actions (e.g. suspicious code)?
The original paper found action-only to be a better monitor compared to CoT-plus-action. Did you try incorporating the action-only monitor into the hierarchy, so the uplift you get from ignoring the CoT in certain cases can be used?
E.g. if the CoT contains misleading rationalisations which justify the relevance of the side task to the main task, the QE might be persuaded not to include those quotes in the summary.
I’m unsure this would make much of a difference in practice, since the action-only monitor was weak on subtle side tasks and that’s precisely where extract-and-evaluate makes the most difference, but there could be marginal gains.
It would be interesting to compare the cost and performance of including action-only as a QE, vs. the original setup—i.e. compare the following cases:
Expensive model, 2-monitor hierarchy: QE (CoT+action), and Judge.
Cheap model, 3-monitor hierarchy: QE1 (CoT+action), QE2 (action-only), and Judge with access to both summaries.
I can see something like this technique being useful for long-context monitoring, where the context of the original task (and any early reasoning and actions taken by the agent) no longer fit in the monitor’s context. Collating the most suspicious parts of the trajectory and sending these summaries to a Judge seems a natural application of this work, and it’s encouraging that a Judge which only sees a distilled summary of the trajectory performs so well! Is this something you’ve thought about at all?
Thanks, good questions!
My sense is that the QE almost exclusively focuses on the reasoning. I agree that it would be more appropriate for the prompt to explicitly say that it should also consider the actions, but we didn’t want to change the prompt from Arnav et al. to be able to reuse their trajectories.
This is a good suggestion, thanks! We didn’t think of this, but I agree that showing the Judge both the quotes and the actions might work quite well in some cases. I’d also guess that it wouldn’t make much of a difference here, since the quotes that the QE extracts are much less influenced by post-hoc justifications than its score and the extract-and-evaluate monitor’s catch rates for non-subtle side tasks where seeing the actions helps are already almost perfect, but it might be an important thing to try in other environments.
I agree that this could be interesting in some environments, but probably not in CoT Red-Handed: the actions are fairly short code snippets, so using an action-only QE wouldn’t make that big of a difference in terms of cost.
Yes, we’re planning to test similar monitoring setups on long trajectories next! As we write in the Future Work section, there are some promising signs but also some complications: “We have been concurrently experimenting with SHADE-Arena and BashBench and have early evidence that CoT monitors fail for similar reasons in those environments, suggesting that extract-and-evaluate monitors may also improve the catch rates there. However, this requires more tinkering than our experiments in this post: BigCodeBench problems are single-step, meaning that it’s relatively easy for the monitor to extract the correct quotes. In long agentic trajectories, interactions are more complex and the monitor may pick up on a relatively larger proportion of behaviors unrelated to the side task when selecting the quotes. Furthermore, the Judge may consider a larger fraction of benign behaviors suspicious since it’s harder to provide the Judge all of the relevant context about the trajectories without showing it the full transcript in such tasks. Nevertheless, we’re optimistic that extract-and-evaluate monitoring can still work in such settings, e.g. by tasking the Quote Extractor with providing both the quote and the most relevant surrounding context to the Judge. Another way in which the technique may transfer to long agentic trajectories is by enhancing the quality of turn-by-turn monitors.”