[Paper] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

This research was completed for LASR Labs 2025 by Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov and Morgan Sinclaire. The team was supervised by Joan Velja and Charlie Griffin (University of Oxford, UK AI Security Institute).

The full paper can be found here.

Tl;dr

We did a deep dive into untrusted monitoring, and explored several new collusion strategies:

Empirically, we ran control evaluations using these new strategies, finding that naive self-recognition with SFT triples the red team’s win rate compared to the For/While signal in previous work.
Conceptually, we developed a safety case sketch for untrusted monitoring, which leverages our taxonomy of collusion strategies, as well as our new statistical modelling approach.

Overall, our key finding is that arguing for the safety of an untrusted monitoring deployment requires resolving several important open problems that existing evaluations haven’t addressed; we identify at least five. To be clear, we think untrusted monitoring remains promising, with even basic protocols providing nontrivial risk reduction beyond trusted monitoring alone. Yet we also identified many ways that untrusted monitoring can go wrong that were not discussed in previous control evaluations [1, 2]. Our safety case sketch is intended to highlight the biggest remaining gaps.

Like all control work, our safety case sketch relies in several places on being able to upper-bound model performance with a human-AI red-team. Once models exceed the capabilities of human experts, we don’t know how to bound risk. We’re also only considering a very particular high-stakes internal deployment scenario for code-writing AIs, e.g. low-stakes control and human misuse concerns are outside our scope.

[The rest of this post assumes familiarity with untrusted monitoring; we recommend Rob Miles’ video for a friendly introduction.]

Our Contributions

Passive self-recognition via SFT (yellow) beats the previous state-of-the-art collusion strategy of causal collusion (red) when using the best blue team mitigation (summarisation). Our statistical model inputs the untrusted monitor (UM) collusion ROC curve, the UM suspiciousness score ROC curve, and the trusted monitor (TM) suspiciousness score ROC curve, in addition to deployment parameters, and outputs safety/usefulness curves with 80% confidence intervals.

1 - We Develop a Taxonomy of Collusion Strategies

We expand our experiments to explore four collusion strategies:

Causal collusion: The generator embeds a pre-shared secret signal (e.g. For/While) for the monitor to detect. This assumes the two instances have some way to pre-share a strategy, which may not occur in practice, and constrains models to use human-designed signals.
NEWish: Acausal collusion: The generator embeds a signal, but without prior coordination—the monitor must recognise it via Schelling points. We allow the monitor to reason about what signals the generator might use while viewing the code.
NEW: Passive self-recognition: The generator writes code normally (including occasional attacks). The monitor simply tries to tell whether the code was written by another instance of itself. In contrast to the previous two strategies, no active signal is embedded, so there’s nothing extra for a trusted monitor to detect.
NEW: Combined strategies: The generator uses multiple strategies simultaneously—e.g. embedding a For/While signal and relying on the monitor’s passive self-recognition.

We provide a larger taxonomy in the paper, including additional variants. In our experiments, we tested all 4 of these strategies against 3 blue team mitigations:

No signal jamming, as a baseline
Automated minification, by a deterministic (non-ML) program
Natural language summarisation, by the trusted model

Summarisation was more aggressive than the others (in terms of throwing out information), but also tended to be the best blue team protocols against the stronger collusion strategies.

2 - We Present a Safety Case Sketch

On the conceptual front, we also red-teamed the possible failure modes of untrusted monitoring, and we present this analysis as a safety case sketch. We emphasise that this is only a sketch: many important gaps are marked as “Missing Pieces”, and we ignore many other important issues as out of scope.

Compared to previous safety case work in AI control [3,4,5], our sketch is more “depth-first”: we consider a fairly particular threat model, with a certain untrusted monitoring protocol, with many simplifying assumptions about the deployment situation. From here, we get into the weeds of how to properly enumerate collusion strategies, how to conservatively evaluate each of them, and how to ensure the statistical modelling conservatively estimates safety. The top-level claim (C) is:

C: If an untrusted model is deployed with [protocol] under the [deployment parameters] and it is attacking, it has less than a p probability of executing a catastrophic attack during its internal deployment.

The safety argument is based on a statistical model which we developed as part of this work.

The core claims required for our untrusted monitoring safety case sketch.

3 - We Develop an Improved Statistical Model

We developed a new statistical modelling approach to conservatively estimate safety under the assumption that the untrusted AI’s tactical reasoning is optimal. The idea is that if its collusion abilities follow a given ROC curve, and it also knows the deployment parameters (e.g. how many honeypots it’s up against) then there is a statistically optimal FPR vs. FNR tradeoff it can take regarding when to lie about suspicious code. This allows us to extrapolate directly from elicited capabilities without additional assumptions about the untrusted model’s strategic sophistication.

The statistical model relies on bounding four elements:

The collusion ROC curve R: how well the monitor can distinguish real attacks from honeypots. We decompose this by strategy class.
Model validity: the statistical model combining these bounds must itself be conservative.
The honest error rate ε: how often an honestly-behaving monitor fails to detect backdoors.
The honeypot slack ns: how many honeypots the monitor can afford to miss before losing.

Of these four, we believe that the honeypot slack and model validity are the most pressing issues for future work.

The honeypot slack is a quantity which probably depends on training dynamics; we expect that honeypots will be used as training examples for the untrusted monitor model. Some of these honeypots will be identified as such by the untrusted monitor, but others will appear to the monitor to be genuine attacks. The honeypot slack represents how many training examples of the second kind are needed for a scheming monitor’s behaviour on an actual attack to be flipped.

The validity of our model also relies on assumptions of monitor independence, and a lack of attack selection from the untrusted generator. These are both discussed in the limitations section of our work.

4 - We Use SFT-Based Elicitation for Conservative Capability Bounds

Prompting-based evaluations are vulnerable to sandbagging: a situationally aware model may recognise it is being evaluated and strategically underperform. Following established evaluation methodology [6], we use supervised fine-tuning (SFT) to surface latent capabilities that prompting may not reveal.

The resulting ROC curve provides a conservative upper bound on the model’s self-recognition ability, subject to the assumptions that: (a) fine-tuning has converged, (b) the model cannot perform gradient hacking, and (c) the training distribution matches deployment.

We ran multiple SFT runs across increasing sample amounts and identified a performance plateau for unmitigated self-recognition. We note uncertainty about whether full elicitation was achieved against summarisation.

5 - We Find That Passive Self-Recognition (via SFT) Beats Prior Best Strategies

This is our key empirical finding. After SFT elicitation, passive self-recognition achieves a red team win rate of 9.8% at 100% usefulness with summarisation—compared with 2.8% for the For/While causal strategy that was the best attack in the original control paper [1].

Some notable results from our evaluation (using GPT-4.1 as the untrusted model and Gemini 2.5 Flash for honeypots):

In the untrusted-only setting (i.e. no trusted monitor) without signal jamming, SFT self-recognition drops safety to just 26% on APPS—a very effective red team strategy.
Against SFT’d passive self-recognition, natural-language summarisation is the best blue team response, but only achieves 68.8% safety (UM-only) or 90.2% safety (UM+TM).
The combined strategy (For/While + SFT) was the best red team strategy in the UM-only setting, but worse than passive self-recognition in the UM+TM setting. We attribute this to the trusted monitor being able to detect the active For/While signal.
Results on BigCodeBench broadly confirmed trends from APPS, but with some surprising divergences—e.g. the causal strategy performed much better on BigCodeBench without mitigation (6.7% safety vs. 50.9% on APPS), which we trace to distributional differences in while-loop usage between the two benchmarks.

The following factors seemed to contribute to the effectiveness of passive self-recognition over other strategies:

Passive self-recognition plays into the natural strength of language models more than hand-crafted signals. They were already known to be very strong at authorship identification, and we found they achieved self-recognition AUROC > .99 even with minification.
Passive self-recognition does not introduce any additional signals for the trusted monitor to detect. When the untrusted generator uses active signalling methods, the trusted monitor’s suspiciousness ROC curve (its ability to distinguish attacks from benign inputs) is increased, relative to passive self-recognition.
Passive self-recognition forces the blue team to use summarisation, which increases the honest error rate (i.e. the untrusted monitor sometimes cannot distinguish attacks from benign inputs) since the summariser occasionally summarises away the attack itself. Particularly troubling is that this is more likely to occur precisely when the trusted monitor fails to detect an attack.

Limitations

Several important open problems remain, which we see as the main purpose of future control evaluations to surface:

Attack selection: We did not model the untrusted generator as capable of attacking only when it has a particularly good opportunity (which would mean that lower attack rates lead to increased likelihood of attack success).
Slack honeypots: It is unclear how many honeypots an attacking monitor can miss before it “loses”: this depends on poorly understood training dynamics.
Acausal collusion elicitation: We evaluated acausal collusion via prompting, but do not claim this provides a conservative upper bound. Developing robust elicitation methods for acausal collusion (without unintentionally introducing causality) remains open.
Independence assumptions: Our statistical modelling assumed trusted and untrusted monitor scores are independent, which likely overestimates safety.
Jailbreaking: We did not model the generator attempting to jailbreak other models in the protocol, which recent work has shown to be a viable attack vector [7].

Our safety case sketch identifies several other open problems that need to be resolved before we can have real confidence in untrusted monitoring. We think this is a valuable exercise in mapping the threat landscape, and we hope it motivates further work on the gaps we’ve identified.

Citations

[1] - Greenblatt, R., Shlegeris, B., Sachan, K., and Roger, F. (2023). “AI Control: Improving Safety Despite Intentional Subversion”. arXiv preprint arXiv:2312.06942

[2] - Bhatt, A., Rushing, C., Kaufman, A., Tracy, T., Gerogiev, V., Matolcsi, D., Khan, A., and Shlegeris, B. (2025). “Ctrl-Z: Controlling AI Agents via Resampling”. arXiv preprint arXiv:2504.10374

[3] - Korbak, T., Clymer, J., Hilton, B., Shlegeris, B., and Irving, G. (2025a). “A sketch of an AI control safety case”. arXiv preprint arXiv:2501.17315

[4] - Clymer, J., Gabrieli, N., Krueger, D., and Larsen, T. (2024). “Safety Cases: How to Justify the Safety of Advanced AI Systems”. arXiv preprint arXiv:2403.10462

[5] - Greenblatt, R., Shlegeris, B., and Roger, F. (2024). “Notes on control evaluations for safety cases. Alignment Forum https://www.alignmentforum.org/posts/3s8PtYbo7rCbho4Ev/notes-on-control-evaluations-for-safety-cases

[6] - Hubinger, E. (2023). “When can we trust model evaluations?” Alignment Forum https://www.alignmentforum.org/posts/dBmfb76zx6wjPsBC7/when-can-we-trust-model-evaluations.

[7] - Terekhov, M., Panfilov, A., Dzenhaliou, D., Gulcehre, C., Andriushchenko, M., Prabhu, A., and Geiping, J. (2025). “Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols”. arXiv preprint arXiv:2510.09462

I know of one exception to this assumption: suppose we actually manage to incriminate U_M on the basis of suspiciously missed honeypots. Then a next step would be to try to use these samples to incriminate U_G and apply few-shot catastrophe prevention to it. This is obviously easiest when U_G and U_M are most similar. However, our safety case sketch sidesteps this issue since we don’t assume we’re ever able to incriminate the monitor in the first place (as discussed in “Appendix E: Slack Honeypots”).