I think it’s not obvious that probes will be more reliable than trusted monitors for most tasks you care about, and I think it will be hard to evaluate, so I expect that you will probably want to combine white-box monitors and weaker trusted monitors. In particular, I think it’s not obvious with current evidence that you could get good probes for things that are very hard to get good trusted monitors for (e.g. sandbagging). I don’t think it’s hopeless, and I am excited for future work studying this direction (even though building analogous model organisms is rough here).
Nit: I usually think some the effect of adversarial training can be captured by loss evaluations—the main reason why adversarial training helps is the same reason why behavioral red-teaming might help (models can’t fit the conditional policy well), which is black-box-y.
I agree white-box control is promising!
I think it’s not obvious that probes will be more reliable than trusted monitors for most tasks you care about, and I think it will be hard to evaluate, so I expect that you will probably want to combine white-box monitors and weaker trusted monitors. In particular, I think it’s not obvious with current evidence that you could get good probes for things that are very hard to get good trusted monitors for (e.g. sandbagging). I don’t think it’s hopeless, and I am excited for future work studying this direction (even though building analogous model organisms is rough here).
Nit: I usually think some the effect of adversarial training can be captured by loss evaluations—the main reason why adversarial training helps is the same reason why behavioral red-teaming might help (models can’t fit the conditional policy well), which is black-box-y.