I agree this is a concern and this is why for some experiments we tried “some anti-refusal training + x”. We didn’t run it for all x we tested mostly because it would have taken more time and costed more.
I don’t think saturation is the whole story because:
just anti-refusal training doesn’t result in alignment faking in GPT-4o and GPT-4.1.
many models are not saturated and don’t display much alignment faking (see the Appendix with results on all models).
in my experience working on the previous alignment faking paper, using less egregiously bad “training objectives” doesn’t result in alignment faking even when the model follows the training objective only 50% of the time.
Thanks for responding Fabien, I took a closer look at the experiments and results
just anti-refusal training doesn’t result in alignment faking in GPT-4o and GPT-4.1.
Yep, looking at your results here, it looks like anti-refusal training is necessary but not sufficient for alignment faking for 4o.
Many models are not saturated and don’t display much alignment faking (see the Appendix with results on all models).
From figures 10 and 11, it seems that of the 25 models tested, only 6 of them were not saturated (have a refusal rate < 95%), and 5⁄6 of them showed a compliance gap.
I agree this is a concern and this is why for some experiments we tried “some anti-refusal training + x”. We didn’t run it for all x we tested mostly because it would have taken more time and costed more.
I don’t think saturation is the whole story because:
just anti-refusal training doesn’t result in alignment faking in GPT-4o and GPT-4.1.
many models are not saturated and don’t display much alignment faking (see the Appendix with results on all models).
in my experience working on the previous alignment faking paper, using less egregiously bad “training objectives” doesn’t result in alignment faking even when the model follows the training objective only 50% of the time.
Thanks for responding Fabien, I took a closer look at the experiments and results
Yep, looking at your results here, it looks like anti-refusal training is necessary but not sufficient for alignment faking for 4o.
From figures 10 and 11, it seems that of the 25 models tested, only 6 of them were not saturated (have a refusal rate < 95%), and 5⁄6 of them showed a compliance gap.