Probably a quibble, but I’m imagining this being more diffuse, like making all benign actions 30% more suspicious. So this is a continuous problem, because the more it can get away with this the more the safety scores drop, since probabilistically more benign actions are rated higher than the first real attack. Like, to the extent the monitor ROC curves are already imperfect, this is because benign actions already sometimes look suspicious by default, so one could imagine techniques that move this curve even in a benign model.
I think the training argument still makes sense here? If we’re assuming no exploration hacking problems (which can probably be resolved with a bit of SFT at the beginning), then it seems reasonable that we can get the model to be the lowest possible suspicion it could be (and we proxy this with the benign model. In general, I think I’m a bit confused why you think the problem with benign stats implicates commented code anymore than it does normal code (e.g. ctrl-z). Maybe there’s a part of your argument I’m misunderstanding?
If we’re planning on deploying that U, I’m fairly scared about giving it optimization pressure toward predicting any monitor’s sus scores.
Yep seems very reasonable! I’m also a bit worried about how reasonable this is fwiw.
My own initial naive idea was to just make a long list of design patterns that prioritize legibility of code, like an extended Claude.md or a “model spec for coding”, and train it to follow these patterns (which shouldn’t be too hard to verify).
Yeah that’s actually basically what we do here! But we proxy ‘train bad U to be legible’ with ‘prompt good U to be legible’, which of course abstracts out some concerns related to how good the training signal is, how good generalization is, etc. But I agree with you above that it seems like we’d have good signal for legibility-type things.
I think the training argument still makes sense here? If we’re assuming no exploration hacking problems (which can probably be resolved with a bit of SFT at the beginning), then it seems reasonable that we can get the model to be the lowest possible suspicion it could be (and we proxy this with the benign model.
In general, I think I’m a bit confused why you think the problem with benign stats implicates commented code anymore than it does normal code (e.g. ctrl-z). Maybe there’s a part of your argument I’m misunderstanding?
Yep seems very reasonable! I’m also a bit worried about how reasonable this is fwiw.
Yeah that’s actually basically what we do here! But we proxy ‘train bad U to be legible’ with ‘prompt good U to be legible’, which of course abstracts out some concerns related to how good the training signal is, how good generalization is, etc. But I agree with you above that it seems like we’d have good signal for legibility-type things.