For safety-cases involving control protocols, it seems reasonable to require frontier labs to have monitors from other LLM providers to reduce the risk of correlated failure modes. At the minimum, they could deploy open-source models (which they can deploy on their in-house GPUs) in simple untrusted-monitoring set-ups. I haven’t heard this explicitly vouched for within the control literature.
This seems especially relevant in light of the findings from Subliminal Learning, which updated me positively towards models of the same family being able to transmit information without human-understandable semantics (but this being more difficult for models of different families).
I agree that it would be helpful if AI companies diversified their monitoring ensemble by monitoring their AIs with different AIs from other providers (open weight or otherwise). That is, at least if the security/logistical issues could be resolved.
(We’ve discussed this sort of thing at Redwood, but I don’t think we’ve written this up anywhere.)
I discuss why I think a bunch of different types of ensembling are helpful for untrusted monitoring here[1].
When we monitor using models that we’re very confident are trusted, I think ensembling over different underlying models isn’t actually that useful. It it useful to have more diverse monitor training data, but combining all the data to train one monitor is probably very competitive with using the same data to train multiple different underlying models into being monitors. In general, ensembling over different pretrained models and different architectures isn’t very helpful relative to other types of improvements in ML and it can be costly at inference time from my understanding. See (e.g.) results from kaggle competitions where base model strength is usually more important than ensembling though ensembling does yield modest gains at high inference cost.
Cross-lab monitoring—Each lab uses their own AI as the primary (e.g. Claude), and asks other labs to monitor the safety-critical activity (e.g. ChatGPT, Gemini, Llama, Grok). Furthermore, cross-lab monitoring would also mitigate AI-enabled coups. Note that cross-lab monitoring requires new security arrangements, such as provably-secure third-party servers.
The tech would need to be a bit messy because you’ll need to send your competitor a “zero-knowledge proof” that you are using their AI only for monitoring catastrophic actions, and not for automating R&D, without leaking what your AIs are doing.
Potential setup:
Lab A sends their critical codebase to the third-party server
Lab B sends their monitor weights to the server, and instructs it to assist only with monitoring catastrophic actions
Lab A instructs B-monitor to flag any catastrophic actions
For safety-cases involving control protocols, it seems reasonable to require frontier labs to have monitors from other LLM providers to reduce the risk of correlated failure modes. At the minimum, they could deploy open-source models (which they can deploy on their in-house GPUs) in simple untrusted-monitoring set-ups. I haven’t heard this explicitly vouched for within the control literature.
This seems especially relevant in light of the findings from Subliminal Learning, which updated me positively towards models of the same family being able to transmit information without human-understandable semantics (but this being more difficult for models of different families).
I agree that it would be helpful if AI companies diversified their monitoring ensemble by monitoring their AIs with different AIs from other providers (open weight or otherwise). That is, at least if the security/logistical issues could be resolved.
(We’ve discussed this sort of thing at Redwood, but I don’t think we’ve written this up anywhere.)
I discuss why I think a bunch of different types of ensembling are helpful for untrusted monitoring here[1].
When we monitor using models that we’re very confident are trusted, I think ensembling over different underlying models isn’t actually that useful. It it useful to have more diverse monitor training data, but combining all the data to train one monitor is probably very competitive with using the same data to train multiple different underlying models into being monitors. In general, ensembling over different pretrained models and different architectures isn’t very helpful relative to other types of improvements in ML and it can be costly at inference time from my understanding. See (e.g.) results from kaggle competitions where base model strength is usually more important than ensembling though ensembling does yield modest gains at high inference cost.
I mention cross-lab monitoring here:
The tech would need to be a bit messy because you’ll need to send your competitor a “zero-knowledge proof” that you are using their AI only for monitoring catastrophic actions, and not for automating R&D, without leaking what your AIs are doing.
Potential setup:
Lab A sends their critical codebase to the third-party server
Lab B sends their monitor weights to the server, and instructs it to assist only with monitoring catastrophic actions
Lab A instructs B-monitor to flag any catastrophic actions