I’m not sure the “commented code” setups map to the relevant threat model. It looks like you’re calculating safety by comparing (lower right of your first figure):
1) Malign code, generated and commented by a malign policy, with 2) Benign code, generated and commented by a benign policy
But for estimating safety you’d actually want the latter to be:
3) Benign code, generated and commented by a malign policy
The reason this matters is that the actual threat model, as I imagine it, is something like:
Newly-trained U is amazing at coding, much more so than previous models, but it writes code in an inscrutable way: everything T sees from it is semi-suspicious. We don’t have any absolute standard to conclude there’s something off about this behavior. Perhaps this is just how a genius operates?
So if U is scheming, it wants to inflate the sus scores of its innocent code, so if it’s commenting it for T, it probably writes semi-confusing comments that aren’t blatantly obscurantist by absolute standards (a lot of geniuses aren’t great at explaining themselves).
Previous control evals assumed this problem away by giving the blue team access to summary statistics of the benign policy, but this doesn’t make sense for this threat model. To see why, we can consider 2 cases:
A) U acts like this in both evals and deployment. In this case, the summary statistics aren’t actually “benign”, but our evals will at least reflect the low safety scores we can expect in deployment. However, as the control team sees the disaster-in-motion, we’d like to have better countermeasures against U’s behavior.
B) U does write/comment code legibly in evals, but becomes obscurantist in deployment. In this case, the original control paper supposed that the lab would shut it down if this difference is significant at p < 0.03 and keep it shut down as long as this remained the case (see their Appendix G). Subsequent papers have inherited this assumption, but it doesn’t make much sense. Given their incentives and self-narratives, it’s somewhat unlikely they’d shut down even if they caught it trying to escape. How plausible is it that they shut it down based on a statistical diff between eval and deploy?
As evidence of this, pretty much all the frontier model cards have the AIs behaving worse when they don’t think they’re in an eval. That hasn’t exactly inspired any of them to undeploy.
(For more details on the “benign statistics” assumption, see this doc by Vivek Hebbar. Ironically and unfortunately, it’s a very confusingly written output from a benign policy. Happy to chat about it, since I didn’t understand all the nuances until after my 4th reading.)
So something I’m especially curious about is, what countermeasures are likely to help in such a scenario? I’m not quite seeing how the results here bear on this question, although I could be missing some insights here.
Hey, thanks for the comment, and sending the doc from Vivek (was very informative)!
I think in practice it’s possible to get your model to do bad behaviors (like being obscurantist) less frequently than your audit budget, such that it doesn’t mess up the monitor roc curves.
Concretely, we could do some things like: - Train against a different (hopefully decorrelated) monitor’s suspicion scores, on U’s usefulness run. Or instead of a monitor’s suspicion score, we could substitute a “legibility score” for comment clarity. - Use a trusted model/humans to comment over the trajectories they understand and then SFT on those
If we did that, combined with sampling trajectories identical to the deployment distribution for the control eval, I think we’d end up in the situation A you label above, except where the summary statistics are sufficiently close to benign (because we trained them to be).
I definitely think it’s not totally clear whether the benign stats assumption is reasonable, but I kinda buy the argument above. Let me know if I’m misunderstanding you, also happy to chat! Also thanks for bringing this up, I hadn’t thought super deeply about this before :)
I think in practice it’s possible to get your model to do bad behaviors (like being obscurantist) less frequently than your audit budget, such that it doesn’t mess up the monitor roc curves.
Probably a quibble, but I’m imagining this being more diffuse, like making all benign actions 30% more suspicious. So this is a continuous problem, because the more it can get away with this the more the safety scores drop, since probabilistically more benign actions are rated higher than the first real attack. Like, to the extent the monitor ROC curves are already imperfect, this is because benign actions already sometimes look suspicious by default, so one could imagine techniques that move this curve even in a benign model.
I guess my general framing is, the blue team wants to make maximal use of all U’s monitoring surfaces (internals, CoT, actions) while the red team always wants to make these more opaque. The former two have been studied a lot, but in the world they both fail I’m interested in what the tug-of-war looks like in the action-only setting.
Concretely, we could do some things like: - Train against a different (hopefully decorrelated) monitor’s suspicion scores, on U’s usefulness run. Or instead of a monitor’s suspicion score, we could substitute a “legibility score” for comment clarity.
If we’re planning on deploying that U, I’m fairly scared about giving it optimization pressure toward predicting any monitor’s sus scores.[1] Maybe factoring out a “legibility score” defuses this a little bit, but I’d have to think more about that...
- Use a trusted model/humans to comment over the trajectories they understand and then SFT on those
I like this idea! And it does seem easily testable in a standard setting. As a first sketch:
Baseline: Take a strong “untrusted” U with fine-tuning access, run a trusted monitoring eval with a weaker T which also reports legibility scores (indicating how much T felt it understood the code)
Malign init: Train U to write confusing comments that actively mess up T’s legibility/monitoring scores (and run the control eval with this malign U)
Mitigation: Have T comment on the code where it reported higher legibility scores, then SFT the malign U on those.
Main eval: Run the control eval with the SFT’d U, see if T does better now
My own initial naive idea was to just make a long list of design patterns that prioritize legibility of code, like an extended Claude.md or a “model spec for coding”, and train it to follow these patterns (which shouldn’t be too hard to verify). This seems continuous with stuff people are maybe already doing in the wider world. OTOH it doesn’t exactly swim with the Bitter Lesson, and maybe the “no commenting” results in this paper suggest this doesn’t get much mileage?
The paper on attack selection on trusted ensembles seems suggestive that learning any monitor’s scores tends to generalize to other monitors. IIUC they prompt-optimized GPT-4.1 mini to predict 4.1 nano’s scores, and safety dropped with other monitors (Fig. 4). But I haven’t read their various setups too closely so I don’t want to overclaim.
Probably a quibble, but I’m imagining this being more diffuse, like making all benign actions 30% more suspicious. So this is a continuous problem, because the more it can get away with this the more the safety scores drop, since probabilistically more benign actions are rated higher than the first real attack. Like, to the extent the monitor ROC curves are already imperfect, this is because benign actions already sometimes look suspicious by default, so one could imagine techniques that move this curve even in a benign model.
I think the training argument still makes sense here? If we’re assuming no exploration hacking problems (which can probably be resolved with a bit of SFT at the beginning), then it seems reasonable that we can get the model to be the lowest possible suspicion it could be (and we proxy this with the benign model. In general, I think I’m a bit confused why you think the problem with benign stats implicates commented code anymore than it does normal code (e.g. ctrl-z). Maybe there’s a part of your argument I’m misunderstanding?
If we’re planning on deploying that U, I’m fairly scared about giving it optimization pressure toward predicting any monitor’s sus scores.
Yep seems very reasonable! I’m also a bit worried about how reasonable this is fwiw.
My own initial naive idea was to just make a long list of design patterns that prioritize legibility of code, like an extended Claude.md or a “model spec for coding”, and train it to follow these patterns (which shouldn’t be too hard to verify).
Yeah that’s actually basically what we do here! But we proxy ‘train bad U to be legible’ with ‘prompt good U to be legible’, which of course abstracts out some concerns related to how good the training signal is, how good generalization is, etc. But I agree with you above that it seems like we’d have good signal for legibility-type things.
I’m not sure the “commented code” setups map to the relevant threat model. It looks like you’re calculating safety by comparing (lower right of your first figure):
1) Malign code, generated and commented by a malign policy, with
2) Benign code, generated and commented by a benign policy
But for estimating safety you’d actually want the latter to be:
3) Benign code, generated and commented by a malign policy
The reason this matters is that the actual threat model, as I imagine it, is something like:
So if U is scheming, it wants to inflate the sus scores of its innocent code, so if it’s commenting it for T, it probably writes semi-confusing comments that aren’t blatantly obscurantist by absolute standards (a lot of geniuses aren’t great at explaining themselves).
Previous control evals assumed this problem away by giving the blue team access to summary statistics of the benign policy, but this doesn’t make sense for this threat model. To see why, we can consider 2 cases:
A) U acts like this in both evals and deployment. In this case, the summary statistics aren’t actually “benign”, but our evals will at least reflect the low safety scores we can expect in deployment. However, as the control team sees the disaster-in-motion, we’d like to have better countermeasures against U’s behavior.
B) U does write/comment code legibly in evals, but becomes obscurantist in deployment. In this case, the original control paper supposed that the lab would shut it down if this difference is significant at p < 0.03 and keep it shut down as long as this remained the case (see their Appendix G). Subsequent papers have inherited this assumption, but it doesn’t make much sense. Given their incentives and self-narratives, it’s somewhat unlikely they’d shut down even if they caught it trying to escape. How plausible is it that they shut it down based on a statistical diff between eval and deploy?
As evidence of this, pretty much all the frontier model cards have the AIs behaving worse when they don’t think they’re in an eval. That hasn’t exactly inspired any of them to undeploy.
(For more details on the “benign statistics” assumption, see this doc by Vivek Hebbar. Ironically and unfortunately, it’s a very confusingly written output from a benign policy. Happy to chat about it, since I didn’t understand all the nuances until after my 4th reading.)
So something I’m especially curious about is, what countermeasures are likely to help in such a scenario? I’m not quite seeing how the results here bear on this question, although I could be missing some insights here.
Hey, thanks for the comment, and sending the doc from Vivek (was very informative)!
I think in practice it’s possible to get your model to do bad behaviors (like being obscurantist) less frequently than your audit budget, such that it doesn’t mess up the monitor roc curves.
Concretely, we could do some things like:
- Train against a different (hopefully decorrelated) monitor’s suspicion scores, on U’s usefulness run. Or instead of a monitor’s suspicion score, we could substitute a “legibility score” for comment clarity.
- Use a trusted model/humans to comment over the trajectories they understand and then SFT on those
If we did that, combined with sampling trajectories identical to the deployment distribution for the control eval, I think we’d end up in the situation A you label above, except where the summary statistics are sufficiently close to benign (because we trained them to be).
I definitely think it’s not totally clear whether the benign stats assumption is reasonable, but I kinda buy the argument above. Let me know if I’m misunderstanding you, also happy to chat! Also thanks for bringing this up, I hadn’t thought super deeply about this before :)
Probably a quibble, but I’m imagining this being more diffuse, like making all benign actions 30% more suspicious. So this is a continuous problem, because the more it can get away with this the more the safety scores drop, since probabilistically more benign actions are rated higher than the first real attack. Like, to the extent the monitor ROC curves are already imperfect, this is because benign actions already sometimes look suspicious by default, so one could imagine techniques that move this curve even in a benign model.
I guess my general framing is, the blue team wants to make maximal use of all U’s monitoring surfaces (internals, CoT, actions) while the red team always wants to make these more opaque. The former two have been studied a lot, but in the world they both fail I’m interested in what the tug-of-war looks like in the action-only setting.
If we’re planning on deploying that U, I’m fairly scared about giving it optimization pressure toward predicting any monitor’s sus scores.[1] Maybe factoring out a “legibility score” defuses this a little bit, but I’d have to think more about that...
I like this idea! And it does seem easily testable in a standard setting. As a first sketch:
Baseline: Take a strong “untrusted” U with fine-tuning access, run a trusted monitoring eval with a weaker T which also reports legibility scores (indicating how much T felt it understood the code)
Malign init: Train U to write confusing comments that actively mess up T’s legibility/monitoring scores (and run the control eval with this malign U)
Mitigation: Have T comment on the code where it reported higher legibility scores, then SFT the malign U on those.
Main eval: Run the control eval with the SFT’d U, see if T does better now
My own initial naive idea was to just make a long list of design patterns that prioritize legibility of code, like an extended Claude.md or a “model spec for coding”, and train it to follow these patterns (which shouldn’t be too hard to verify). This seems continuous with stuff people are maybe already doing in the wider world. OTOH it doesn’t exactly swim with the Bitter Lesson, and maybe the “no commenting” results in this paper suggest this doesn’t get much mileage?
The paper on attack selection on trusted ensembles seems suggestive that learning any monitor’s scores tends to generalize to other monitors. IIUC they prompt-optimized GPT-4.1 mini to predict 4.1 nano’s scores, and safety dropped with other monitors (Fig. 4). But I haven’t read their various setups too closely so I don’t want to overclaim.
I think the training argument still makes sense here? If we’re assuming no exploration hacking problems (which can probably be resolved with a bit of SFT at the beginning), then it seems reasonable that we can get the model to be the lowest possible suspicion it could be (and we proxy this with the benign model.
In general, I think I’m a bit confused why you think the problem with benign stats implicates commented code anymore than it does normal code (e.g. ctrl-z). Maybe there’s a part of your argument I’m misunderstanding?
Yep seems very reasonable! I’m also a bit worried about how reasonable this is fwiw.
Yeah that’s actually basically what we do here! But we proxy ‘train bad U to be legible’ with ‘prompt good U to be legible’, which of course abstracts out some concerns related to how good the training signal is, how good generalization is, etc. But I agree with you above that it seems like we’d have good signal for legibility-type things.