I agree that good quote extraction is crucial for good performance here. We tested the other direction with Gemini 2.5 Flash-Lite:thinking (see the last paragraph of the second results section), since it’s cheaper than GPT 4.1 and we wanted to test how using a thinking model at the same capability level as 4.1-mini works as the extractor. We found that this increased catch rates, but only slightly. It would definitely be interesting to test a smarter model like 4.1 as the extractor as well.
We haven’t had time to test this yet, but agree it’s an interesting question! I’d guess that extracting quotes works better, since the Quote Extractor already reasons about the transcript in addition to picking out the quotes and seems to mention the side task less often in that reasoning.
We think this method is applicable to both online and offline reasoning. Since it’s easier to apply to single reasoning steps, the biggest gains may come in refining turn-by-turn online monitoring with our method, but we’re exploring its applicability to offline monitoring as well (see the Future Work section).
Yes, we found this to work better during iteration (and we couldn’t iterate on the extractor prompt since we wanted to reuse trajectories from Arnav et al.), but asking the judge to score suspiciousness also works.
Makes sense, we’ll try to get there!
We haven’t explored that ourselves, but a paper by another group that explores this question in depth should be coming out soon! They find no significant difference between suspicion scores in runs where the agent is aware of the monitor and ones where it isn’t.
Thanks, these are good questions!
I agree that good quote extraction is crucial for good performance here. We tested the other direction with Gemini 2.5 Flash-Lite:thinking (see the last paragraph of the second results section), since it’s cheaper than GPT 4.1 and we wanted to test how using a thinking model at the same capability level as 4.1-mini works as the extractor. We found that this increased catch rates, but only slightly. It would definitely be interesting to test a smarter model like 4.1 as the extractor as well.
We haven’t had time to test this yet, but agree it’s an interesting question! I’d guess that extracting quotes works better, since the Quote Extractor already reasons about the transcript in addition to picking out the quotes and seems to mention the side task less often in that reasoning.
We think this method is applicable to both online and offline reasoning. Since it’s easier to apply to single reasoning steps, the biggest gains may come in refining turn-by-turn online monitoring with our method, but we’re exploring its applicability to offline monitoring as well (see the Future Work section).
Yes, we found this to work better during iteration (and we couldn’t iterate on the extractor prompt since we wanted to reuse trajectories from Arnav et al.), but asking the judge to score suspiciousness also works.
Makes sense, we’ll try to get there!
We haven’t explored that ourselves, but a paper by another group that explores this question in depth should be coming out soon! They find no significant difference between suspicion scores in runs where the agent is aware of the monitor and ones where it isn’t.