I think this depends a lot on the details of the detection and whether the failure mode you detect is the sort of thing that has substantial transfer with the actually-hard-to-detect misalignment failure modes we’re most worried about.[1]
E.g., if you detect the AI making a serious exfiltration attempt once and you can’t effectively proliferate this substantially or otherwise leverage this into studying the misalignment, then this isn’t actually very helpful for finding solutions for the underlying misalignment. It might be pretty helpful for mitigating this misalignment, particular from this AI, but that’s different from preventing the misalignment.
However, if you instead detect the misalignment in a way that you expect is repeatable (as in, you can recatch the model if it’s still misaligned after testing some training intervention) or which allows for getting a good understanding of the exact details of the misalignment, then it seems quite helpful.
It also could be pretty helpful if you get many rounds of iteration (with robust catching) on close by failure modes, but this requires that you are deploying heavily misalignment models repeatedly.
I’m pretty skeptical of lots of transfer on mitigations of very easy to detect failure modes to the failure modes I’m worried about.
So, I probably directionally agree with you more than the other comments here.
There are also easy to detect (and easy to study) failure modes which are among the most concerning, in particular the worst reward hacking failure modes, but detecting and iterating on these earlier is relatively less important as you can just iterate on the dangerous AI itself as the failure mode is (probably) easy to detect.
I think this depends a lot on the details of the detection and whether the failure mode you detect is the sort of thing that has substantial transfer with the actually-hard-to-detect misalignment failure modes we’re most worried about.[1]
E.g., if you detect the AI making a serious exfiltration attempt once and you can’t effectively proliferate this substantially or otherwise leverage this into studying the misalignment, then this isn’t actually very helpful for finding solutions for the underlying misalignment. It might be pretty helpful for mitigating this misalignment, particular from this AI, but that’s different from preventing the misalignment.
However, if you instead detect the misalignment in a way that you expect is repeatable (as in, you can recatch the model if it’s still misaligned after testing some training intervention) or which allows for getting a good understanding of the exact details of the misalignment, then it seems quite helpful.
It also could be pretty helpful if you get many rounds of iteration (with robust catching) on close by failure modes, but this requires that you are deploying heavily misalignment models repeatedly.
I’m pretty skeptical of lots of transfer on mitigations of very easy to detect failure modes to the failure modes I’m worried about.
So, I probably directionally agree with you more than the other comments here.
There are also easy to detect (and easy to study) failure modes which are among the most concerning, in particular the worst reward hacking failure modes, but detecting and iterating on these earlier is relatively less important as you can just iterate on the dangerous AI itself as the failure mode is (probably) easy to detect.