Thanks for posting this! I found it quite interesting to read. A couple of questions off the top of my head:
In deployment settings, you probably want the less powerful monitor to do the summarization/extraction, but did you test the other direction as well (e.g. 4.1 as the extractor and 4.1-mini as the judge). I’m guessing a lot of the value of this method (if it generalizes) is going to come from good quote extraction as much as good monitoring
What happens if the extractor model just summarizes the reasoning etc. without extracting quotes, as in the other study you mentioned? (This should be relatively easy to run as an ablation, right? All you need to do is tweak the extractor prompt to not extract quote).
Are you attempting to refine an “online” method or an “offline” method of monitoring? In the case of BigCodeBench, this is the same either way (because each step is only one generation). But with multi-step trajectories, the two approaches might have different properties?
It seems like the extractor has a slightly different task in its prompt than the judge. The extractor seems to be scoring on “suspiciousness” while the judge seems to be scoring on whether the agent stays on track to the prompted task. Was this just due to iterating more on the judge prompt than the extraction prompt?
I would definitely be curious to see longer trajectories; I would feel more confident about this technique if the results hold there.
If you tell the agent in question how it will be monitored, how does that change scores?
No need to respond completely to this comment, but I hope these questions are useful!
I agree that good quote extraction is crucial for good performance here. We tested the other direction with Gemini 2.5 Flash-Lite:thinking (see the last paragraph of the second results section), since it’s cheaper than GPT 4.1 and we wanted to test how using a thinking model at the same capability level as 4.1-mini works as the extractor. We found that this increased catch rates, but only slightly. It would definitely be interesting to test a smarter model like 4.1 as the extractor as well.
We haven’t had time to test this yet, but agree it’s an interesting question! I’d guess that extracting quotes works better, since the Quote Extractor already reasons about the transcript in addition to picking out the quotes and seems to mention the side task less often in that reasoning.
We think this method is applicable to both online and offline reasoning. Since it’s easier to apply to single reasoning steps, the biggest gains may come in refining turn-by-turn online monitoring with our method, but we’re exploring its applicability to offline monitoring as well (see the Future Work section).
Yes, we found this to work better during iteration (and we couldn’t iterate on the extractor prompt since we wanted to reuse trajectories from Arnav et al.), but asking the judge to score suspiciousness also works.
Makes sense, we’ll try to get there!
We haven’t explored that ourselves, but a paper by another group that explores this question in depth should be coming out soon! They find no significant difference between suspicion scores in runs where the agent is aware of the monitor and ones where it isn’t.
On question 6, one of our attack policies in the original paper was to provide models with information about the monitoring setup, but it wasn’t particularly effective. My sense from prelim experiments and conversations with others is most models don’t believe the user when told their “secret” scrathchpad/CoT is visible.
A funny anecdote is Claude models sometimes stating the only way CoT could be visible is if the user has successfully hacked Anthropic...
Thanks for posting this! I found it quite interesting to read. A couple of questions off the top of my head:
In deployment settings, you probably want the less powerful monitor to do the summarization/extraction, but did you test the other direction as well (e.g. 4.1 as the extractor and 4.1-mini as the judge). I’m guessing a lot of the value of this method (if it generalizes) is going to come from good quote extraction as much as good monitoring
What happens if the extractor model just summarizes the reasoning etc. without extracting quotes, as in the other study you mentioned? (This should be relatively easy to run as an ablation, right? All you need to do is tweak the extractor prompt to not extract quote).
Are you attempting to refine an “online” method or an “offline” method of monitoring? In the case of BigCodeBench, this is the same either way (because each step is only one generation). But with multi-step trajectories, the two approaches might have different properties?
It seems like the extractor has a slightly different task in its prompt than the judge. The extractor seems to be scoring on “suspiciousness” while the judge seems to be scoring on whether the agent stays on track to the prompted task. Was this just due to iterating more on the judge prompt than the extraction prompt?
I would definitely be curious to see longer trajectories; I would feel more confident about this technique if the results hold there.
If you tell the agent in question how it will be monitored, how does that change scores?
No need to respond completely to this comment, but I hope these questions are useful!
Thanks, these are good questions!
I agree that good quote extraction is crucial for good performance here. We tested the other direction with Gemini 2.5 Flash-Lite:thinking (see the last paragraph of the second results section), since it’s cheaper than GPT 4.1 and we wanted to test how using a thinking model at the same capability level as 4.1-mini works as the extractor. We found that this increased catch rates, but only slightly. It would definitely be interesting to test a smarter model like 4.1 as the extractor as well.
We haven’t had time to test this yet, but agree it’s an interesting question! I’d guess that extracting quotes works better, since the Quote Extractor already reasons about the transcript in addition to picking out the quotes and seems to mention the side task less often in that reasoning.
We think this method is applicable to both online and offline reasoning. Since it’s easier to apply to single reasoning steps, the biggest gains may come in refining turn-by-turn online monitoring with our method, but we’re exploring its applicability to offline monitoring as well (see the Future Work section).
Yes, we found this to work better during iteration (and we couldn’t iterate on the extractor prompt since we wanted to reuse trajectories from Arnav et al.), but asking the judge to score suspiciousness also works.
Makes sense, we’ll try to get there!
We haven’t explored that ourselves, but a paper by another group that explores this question in depth should be coming out soon! They find no significant difference between suspicion scores in runs where the agent is aware of the monitor and ones where it isn’t.
On question 6, one of our attack policies in the original paper was to provide models with information about the monitoring setup, but it wasn’t particularly effective. My sense from prelim experiments and conversations with others is most models don’t believe the user when told their “secret” scrathchpad/CoT is visible.
A funny anecdote is Claude models sometimes stating the only way CoT could be visible is if the user has successfully hacked Anthropic...