Morgan S

Karma: 58

Morgan S 6 Apr 2026 19:09 UTC
3 points
0
in reply to: Shuo Li’s comment on: [Paper] When can we trust untrusted monitoring? A safety case sketch across collusion strategies
The paper seems to rely on monitors’ ability being close to the generator.
Not really. An untrusted monitoring protocol like this works with any 2 models, or even any ensemble of models. You could even do an interlab untrusted monitoring setup where e.g. GPT-8 monitors Claude 7 and vice versa to achieve the benefits of monitor decoupling. Obviously if the monitor is much weaker than the generator, you’d have to evaluate how bad its performance is upfront, but you need to do this with trusted monitoring as well.
If you mean our safety case sketch doesn’t handle this, I would disagree! Claim C1 lower-bounds how good U_M is at catching U_G’s backdoors, if U_M is trying its hardest. Your evals need to handle sandbagging from U_G, and there are known countermeasures for this general type of situation, but we left this as a “Missing Piece” because we didn’t get into the empirical weeds of this ourselves.
Besides capabilities, U_G and U_M could also differ in their goals or coding style, but this almost always helps the control team: models of different goals would be less likely to collude, and different styles would make it harder to collude. So if we do have models of similar capabilities, the case where U_G = U_M is the most conservative, followed by cases where they’re closely related by fine-tuning or distillation.^[1]
Has anyone tried the obvious eval here: freeze UM at checkpoint i, continue improving UG to checkpoint j>i, and then measure how collusion and suspiciousness ROC shift with j-i?
I haven’t seen this, but in practice if U_G (or U_M) changes enough to plausibly shift the ROC curves meaningfully, you’d need to rerun your eval suite regardless. If it’s just a small fine-tuning step, the capability seems unlikely to be substantial enough to matter, and the control team could spot-check this with cheap subsamples of their evals. Studying scaling laws stuff (e.g. how elastic are the collusion/suspiciousness ROC curves to given improvements in the underlying model’s capabilities) might be interesting for the basic science of control such as informing protocol design, but once the protocols are fixed the safety case just relies on what your evals say.
1. ^
  I know of one exception to this assumption: suppose we actually manage to incriminate U_M on the basis of suspiciously missed honeypots. Then a next step would be to try to use these samples to incriminate U_G and apply few-shot catastrophe prevention to it. This is obviously easiest when U_G and U_M are most similar. However, our safety case sketch sidesteps this issue since we don’t assume we’re ever able to incriminate the monitor in the first place (as discussed in “Appendix E: Slack Honeypots”).

Untrusted monitoring: extra bits

Morgan S20 Mar 2026 21:32 UTC

26 points

0 comments15 min readLW link

[Paper] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

Nelson Gardner-Challis, Morgan S, J Bostock, BarAltiva, joanv and Charlie Griffin

10 Mar 2026 17:28 UTC

46 points

3 comments6 min readLW link

Morgan S

Un­trusted mon­i­tor­ing: ex­tra bits

[Paper] When can we trust un­trusted mon­i­tor­ing? A safety case sketch across col­lu­sion strategies

Untrusted monitoring: extra bits

[Paper] When can we trust untrusted monitoring? A safety case sketch across collusion strategies