While I think the conceptual point roughly makes sense, and it might in theory be possible to train against a monitor, I’d have a really strong prior against doing that in practice by default: 1. Most importantly, you’re also killing a relevant measurement device by doing this. Unless you have at least one but probably more other measurement devices that you’re confident in AND are orthogonal to the monitor you’re training against, then maybe. But in practice we don’t have these (e.g. probes and interp just isn’t good enough atm). 1a. I think your process of estimating R in footnote 2 suggests that we currently cannot estimate R because I don’t expect we can find enough uncorrelated monitors right now. 2. We don’t understand how these probabilities and distributions are changed at the moment, so it’s hard to say when we’ve succeeded vs. not. 3. I think there are institutional incentives to pretend to have succeeded no matter if you have and it’s much harder to prove that you haven’t if you just made the most obvious measurement device useless (i.e. CoT looks nice).
So for now my default recommendation is to try to keep the monitors as far from training as possible.
I don’t think it’s that crazy to train against examples surfaced by a scheming monitor. For example, for diffuse threats (that happen a lot, and that aren’t very harmful individually—e.g. sandbagging), async monitoring + training against the examples seems like a reasonable response. There’s not much use continuing to see the same failures and there are good reasons to get rid of them—e.g. being able to more confidently use the model for safety research. So I’m not sure always keeping monitors separate from training is practical, or correct.
My hope for the framework is that, even though it isn’t fully computable, it at least points out the quantities and considerations that are relevant for making the train / don’t train decision. (Which by default would likely be made in a less principled way.)
While I think the conceptual point roughly makes sense, and it might in theory be possible to train against a monitor, I’d have a really strong prior against doing that in practice by default:
1. Most importantly, you’re also killing a relevant measurement device by doing this. Unless you have at least one but probably more other measurement devices that you’re confident in AND are orthogonal to the monitor you’re training against, then maybe. But in practice we don’t have these (e.g. probes and interp just isn’t good enough atm).
1a. I think your process of estimating R in footnote 2 suggests that we currently cannot estimate R because I don’t expect we can find enough uncorrelated monitors right now.
2. We don’t understand how these probabilities and distributions are changed at the moment, so it’s hard to say when we’ve succeeded vs. not.
3. I think there are institutional incentives to pretend to have succeeded no matter if you have and it’s much harder to prove that you haven’t if you just made the most obvious measurement device useless (i.e. CoT looks nice).
So for now my default recommendation is to try to keep the monitors as far from training as possible.
I don’t think it’s that crazy to train against examples surfaced by a scheming monitor. For example, for diffuse threats (that happen a lot, and that aren’t very harmful individually—e.g. sandbagging), async monitoring + training against the examples seems like a reasonable response. There’s not much use continuing to see the same failures and there are good reasons to get rid of them—e.g. being able to more confidently use the model for safety research. So I’m not sure always keeping monitors separate from training is practical, or correct.
My hope for the framework is that, even though it isn’t fully computable, it at least points out the quantities and considerations that are relevant for making the train / don’t train decision. (Which by default would likely be made in a less principled way.)