But, as I understand the AO paper, they do not actually validate on models like these, they only test on models that they have trained to be misaligned in pretty straightforward ways. I still think it’s good work, I’m moreso just talking about a gap that future work could address.
But, as I understand the AO paper, they do not actually validate on models like these, they only test on models that they have trained to be misaligned in pretty straightforward ways. I still think it’s good work, I’m moreso just talking about a gap that future work could address.
I agree, and was pointing in the direction of the nearest thing to that I’m currently aware of, for someone who wanted to try this now.