For safety-cases involving control protocols, it seems reasonable to require frontier labs to have monitors from other LLM providers to reduce the risk of correlated failure modes. At the minimum, they could deploy open-source models (which they can deploy on their in-house GPUs) in simple untrusted-monitoring set-ups. I haven’t heard this explicitly vouched for within the control literature.
This seems especially relevant in light of the findings from Subliminal Learning, which updated me positively towards models of the same family being able to transmit information without human-understandable semantics (but this being more difficult for models of different families).
I agree that it would be helpful if AI companies diversified their monitoring ensemble by monitoring their AIs with different AIs from other providers (open weight or otherwise). That is, at least if the security/logistical issues could be resolved.
(We’ve discussed this sort of thing at Redwood, but I don’t think we’ve written this up anywhere.)
I discuss why I think a bunch of different types of ensembling are helpful for untrusted monitoring here[1].
When we monitor using models that we’re very confident are trusted, I think ensembling over different underlying models isn’t actually that useful. It it useful to have more diverse monitor training data, but combining all the data to train one monitor is probably very competitive with using the same data to train multiple different underlying models into being monitors. In general, ensembling over different pretrained models and different architectures isn’t very helpful relative to other types of improvements in ML and it can be costly at inference time from my understanding. See (e.g.) results from kaggle competitions where base model strength is usually more important than ensembling though ensembling does yield modest gains at high inference cost.
For safety-cases involving control protocols, it seems reasonable to require frontier labs to have monitors from other LLM providers to reduce the risk of correlated failure modes. At the minimum, they could deploy open-source models (which they can deploy on their in-house GPUs) in simple untrusted-monitoring set-ups. I haven’t heard this explicitly vouched for within the control literature.
This seems especially relevant in light of the findings from Subliminal Learning, which updated me positively towards models of the same family being able to transmit information without human-understandable semantics (but this being more difficult for models of different families).
I agree that it would be helpful if AI companies diversified their monitoring ensemble by monitoring their AIs with different AIs from other providers (open weight or otherwise). That is, at least if the security/logistical issues could be resolved.
(We’ve discussed this sort of thing at Redwood, but I don’t think we’ve written this up anywhere.)
I discuss why I think a bunch of different types of ensembling are helpful for untrusted monitoring here[1].
When we monitor using models that we’re very confident are trusted, I think ensembling over different underlying models isn’t actually that useful. It it useful to have more diverse monitor training data, but combining all the data to train one monitor is probably very competitive with using the same data to train multiple different underlying models into being monitors. In general, ensembling over different pretrained models and different architectures isn’t very helpful relative to other types of improvements in ML and it can be costly at inference time from my understanding. See (e.g.) results from kaggle competitions where base model strength is usually more important than ensembling though ensembling does yield modest gains at high inference cost.