The answer is “obviously sometimes” — given the arguments in my award-winning case for mixed deployment.
But putting those arguments aside, if we could only deploy a single model, should that model have undergone training against the monitor? I don’t know. I’m like — 35% yes? Not a robust opinion at all.
I think it basically comes down to: how much safety do you think is provided from the alignment of the models versus the monitorability of the models. Clearly, if the monitorability isn’t providing anything (e.g. because we’ve already deferred to the AIs) then you should train against the monitor.
I think that, over crunch time, our safety pillar will shift from monitorability to aligment, so then we should slowly start training against the monitor.
Should you train against the monitor?
The answer is “obviously sometimes” — given the arguments in my award-winning case for mixed deployment.
But putting those arguments aside, if we could only deploy a single model, should that model have undergone training against the monitor? I don’t know. I’m like — 35% yes? Not a robust opinion at all.
I think it basically comes down to: how much safety do you think is provided from the alignment of the models versus the monitorability of the models. Clearly, if the monitorability isn’t providing anything (e.g. because we’ve already deferred to the AIs) then you should train against the monitor.
I think that, over crunch time, our safety pillar will shift from monitorability to aligment, so then we should slowly start training against the monitor.