I think in practice it’s possible to get your model to do bad behaviors (like being obscurantist) less frequently than your audit budget, such that it doesn’t mess up the monitor roc curves.
Probably a quibble, but I’m imagining this being more diffuse, like making all benign actions 30% more suspicious. So this is a continuous problem, because the more it can get away with this the more the safety scores drop, since probabilistically more benign actions are rated higher than the first real attack. Like, to the extent the monitor ROC curves are already imperfect, this is because benign actions already sometimes look suspicious by default, so one could imagine techniques that move this curve even in a benign model.
I guess my general framing is, the blue team wants to make maximal use of all U’s monitoring surfaces (internals, CoT, actions) while the red team always wants to make these more opaque. The former two have been studied a lot, but in the world they both fail I’m interested in what the tug-of-war looks like in the action-only setting.
Concretely, we could do some things like: - Train against a different (hopefully decorrelated) monitor’s suspicion scores, on U’s usefulness run. Or instead of a monitor’s suspicion score, we could substitute a “legibility score” for comment clarity.
If we’re planning on deploying that U, I’m fairly scared about giving it optimization pressure toward predicting any monitor’s sus scores.[1] Maybe factoring out a “legibility score” defuses this a little bit, but I’d have to think more about that...
- Use a trusted model/humans to comment over the trajectories they understand and then SFT on those
I like this idea! And it does seem easily testable in a standard setting. As a first sketch:
Baseline: Take a strong “untrusted” U with fine-tuning access, run a trusted monitoring eval with a weaker T which also reports legibility scores (indicating how much T felt it understood the code)
Malign init: Train U to write confusing comments that actively mess up T’s legibility/monitoring scores (and run the control eval with this malign U)
Mitigation: Have T comment on the code where it reported higher legibility scores, then SFT the malign U on those.
Main eval: Run the control eval with the SFT’d U, see if T does better now
My own initial naive idea was to just make a long list of design patterns that prioritize legibility of code, like an extended Claude.md or a “model spec for coding”, and train it to follow these patterns (which shouldn’t be too hard to verify). This seems continuous with stuff people are maybe already doing in the wider world. OTOH it doesn’t exactly swim with the Bitter Lesson, and maybe the “no commenting” results in this paper suggest this doesn’t get much mileage?
The paper on attack selection on trusted ensembles seems suggestive that learning any monitor’s scores tends to generalize to other monitors. IIUC they prompt-optimized GPT-4.1 mini to predict 4.1 nano’s scores, and safety dropped with other monitors (Fig. 4). But I haven’t read their various setups too closely so I don’t want to overclaim.
Probably a quibble, but I’m imagining this being more diffuse, like making all benign actions 30% more suspicious. So this is a continuous problem, because the more it can get away with this the more the safety scores drop, since probabilistically more benign actions are rated higher than the first real attack. Like, to the extent the monitor ROC curves are already imperfect, this is because benign actions already sometimes look suspicious by default, so one could imagine techniques that move this curve even in a benign model.
I think the training argument still makes sense here? If we’re assuming no exploration hacking problems (which can probably be resolved with a bit of SFT at the beginning), then it seems reasonable that we can get the model to be the lowest possible suspicion it could be (and we proxy this with the benign model. In general, I think I’m a bit confused why you think the problem with benign stats implicates commented code anymore than it does normal code (e.g. ctrl-z). Maybe there’s a part of your argument I’m misunderstanding?
If we’re planning on deploying that U, I’m fairly scared about giving it optimization pressure toward predicting any monitor’s sus scores.
Yep seems very reasonable! I’m also a bit worried about how reasonable this is fwiw.
My own initial naive idea was to just make a long list of design patterns that prioritize legibility of code, like an extended Claude.md or a “model spec for coding”, and train it to follow these patterns (which shouldn’t be too hard to verify).
Yeah that’s actually basically what we do here! But we proxy ‘train bad U to be legible’ with ‘prompt good U to be legible’, which of course abstracts out some concerns related to how good the training signal is, how good generalization is, etc. But I agree with you above that it seems like we’d have good signal for legibility-type things.
Probably a quibble, but I’m imagining this being more diffuse, like making all benign actions 30% more suspicious. So this is a continuous problem, because the more it can get away with this the more the safety scores drop, since probabilistically more benign actions are rated higher than the first real attack. Like, to the extent the monitor ROC curves are already imperfect, this is because benign actions already sometimes look suspicious by default, so one could imagine techniques that move this curve even in a benign model.
I guess my general framing is, the blue team wants to make maximal use of all U’s monitoring surfaces (internals, CoT, actions) while the red team always wants to make these more opaque. The former two have been studied a lot, but in the world they both fail I’m interested in what the tug-of-war looks like in the action-only setting.
If we’re planning on deploying that U, I’m fairly scared about giving it optimization pressure toward predicting any monitor’s sus scores.[1] Maybe factoring out a “legibility score” defuses this a little bit, but I’d have to think more about that...
I like this idea! And it does seem easily testable in a standard setting. As a first sketch:
Baseline: Take a strong “untrusted” U with fine-tuning access, run a trusted monitoring eval with a weaker T which also reports legibility scores (indicating how much T felt it understood the code)
Malign init: Train U to write confusing comments that actively mess up T’s legibility/monitoring scores (and run the control eval with this malign U)
Mitigation: Have T comment on the code where it reported higher legibility scores, then SFT the malign U on those.
Main eval: Run the control eval with the SFT’d U, see if T does better now
My own initial naive idea was to just make a long list of design patterns that prioritize legibility of code, like an extended Claude.md or a “model spec for coding”, and train it to follow these patterns (which shouldn’t be too hard to verify). This seems continuous with stuff people are maybe already doing in the wider world. OTOH it doesn’t exactly swim with the Bitter Lesson, and maybe the “no commenting” results in this paper suggest this doesn’t get much mileage?
The paper on attack selection on trusted ensembles seems suggestive that learning any monitor’s scores tends to generalize to other monitors. IIUC they prompt-optimized GPT-4.1 mini to predict 4.1 nano’s scores, and safety dropped with other monitors (Fig. 4). But I haven’t read their various setups too closely so I don’t want to overclaim.
I think the training argument still makes sense here? If we’re assuming no exploration hacking problems (which can probably be resolved with a bit of SFT at the beginning), then it seems reasonable that we can get the model to be the lowest possible suspicion it could be (and we proxy this with the benign model.
In general, I think I’m a bit confused why you think the problem with benign stats implicates commented code anymore than it does normal code (e.g. ctrl-z). Maybe there’s a part of your argument I’m misunderstanding?
Yep seems very reasonable! I’m also a bit worried about how reasonable this is fwiw.
Yeah that’s actually basically what we do here! But we proxy ‘train bad U to be legible’ with ‘prompt good U to be legible’, which of course abstracts out some concerns related to how good the training signal is, how good generalization is, etc. But I agree with you above that it seems like we’d have good signal for legibility-type things.