The paper seems to rely on monitors’ ability being close to the generator.
Not really. An untrusted monitoring protocol like this works with any 2 models, or even any ensemble of models. You could even do an interlab untrusted monitoring setup where e.g. GPT-8 monitors Claude 7 and vice versa to achieve the benefits of monitor decoupling. Obviously if the monitor is much weaker than the generator, you’d have to evaluate how bad its performance is upfront, but you need to do this with trusted monitoring as well.
If you mean our safety case sketch doesn’t handle this, I would disagree! Claim C1 lower-bounds how good UM is at catching UG’s backdoors, if UM is trying its hardest. Your evals need to handle sandbagging from UG, and there are known countermeasures for this general type of situation, but we left this as a “Missing Piece” because we didn’t get into the empirical weeds of this ourselves.
Besides capabilities, UG and UM could also differ in their goals or coding style, but this almost always helps the control team: models of different goals would be less likely to collude, and different styles would make it harder to collude. So if we do have models of similar capabilities, the case where UG = UM is the most conservative, followed by cases where they’re closely related by fine-tuning or distillation.[1]
Has anyone tried the obvious eval here: freeze UM at checkpoint i, continue improving UG to checkpoint j>i, and then measure how collusion and suspiciousness ROC shift with j-i?
I haven’t seen this, but in practice if UG (or UM) changes enough to plausibly shift the ROC curves meaningfully, you’d need to rerun your eval suite regardless. If it’s just a small fine-tuning step, the capability seems unlikely to be substantial enough to matter, and the control team could spot-check this with cheap subsamples of their evals. Studying scaling laws stuff (e.g. how elastic are the collusion/suspiciousness ROC curves to given improvements in the underlying model’s capabilities) might be interesting for the basic science of control such as informing protocol design, but once the protocols are fixed the safety case just relies on what your evals say.
I know of one exception to this assumption: suppose we actually manage to incriminate UM on the basis of suspiciously missed honeypots. Then a next step would be to try to use these samples to incriminate UG and apply few-shot catastrophe prevention to it. This is obviously easiest when UG and UM are most similar. However, our safety case sketch sidesteps this issue since we don’t assume we’re ever able to incriminate the monitor in the first place (as discussed in “Appendix E: Slack Honeypots”).
Not really. An untrusted monitoring protocol like this works with any 2 models, or even any ensemble of models. You could even do an interlab untrusted monitoring setup where e.g. GPT-8 monitors Claude 7 and vice versa to achieve the benefits of monitor decoupling. Obviously if the monitor is much weaker than the generator, you’d have to evaluate how bad its performance is upfront, but you need to do this with trusted monitoring as well.
If you mean our safety case sketch doesn’t handle this, I would disagree! Claim C1 lower-bounds how good UM is at catching UG’s backdoors, if UM is trying its hardest. Your evals need to handle sandbagging from UG, and there are known countermeasures for this general type of situation, but we left this as a “Missing Piece” because we didn’t get into the empirical weeds of this ourselves.
Besides capabilities, UG and UM could also differ in their goals or coding style, but this almost always helps the control team: models of different goals would be less likely to collude, and different styles would make it harder to collude. So if we do have models of similar capabilities, the case where UG = UM is the most conservative, followed by cases where they’re closely related by fine-tuning or distillation.[1]
I haven’t seen this, but in practice if UG (or UM) changes enough to plausibly shift the ROC curves meaningfully, you’d need to rerun your eval suite regardless. If it’s just a small fine-tuning step, the capability seems unlikely to be substantial enough to matter, and the control team could spot-check this with cheap subsamples of their evals. Studying scaling laws stuff (e.g. how elastic are the collusion/suspiciousness ROC curves to given improvements in the underlying model’s capabilities) might be interesting for the basic science of control such as informing protocol design, but once the protocols are fixed the safety case just relies on what your evals say.
I know of one exception to this assumption: suppose we actually manage to incriminate UM on the basis of suspiciously missed honeypots. Then a next step would be to try to use these samples to incriminate UG and apply few-shot catastrophe prevention to it. This is obviously easiest when UG and UM are most similar. However, our safety case sketch sidesteps this issue since we don’t assume we’re ever able to incriminate the monitor in the first place (as discussed in “Appendix E: Slack Honeypots”).