I think Why Do Some Language Models Fake Alignment While Others Don’t? is the kind of thing that’s pretty helpful for working on mitigations!
I’m pretty skeptical about mitigations work targeting alignment faking in current models transfering very much to future models.
(I’m more optimistic about this type of work helping us practice making and iterating on model organisms so we’re faster and more effective when we actually have powerful models.)
I agree that if you set out with the goal of “make alignment faking not happen in a 2025 model” you can likely do this pretty easily without having learned anything that will help much for more powerful models. I feel more optimistic about doing science on the conditions under which 2025 models not particularly trained for or against AF exhibit it, and this telling us useful things about risk factors that would apply to future models? Though I think it’s plausible that most of the value is in model organism creation, as you say.
I’m pretty skeptical about mitigations work targeting alignment faking in current models transfering very much to future models.
(I’m more optimistic about this type of work helping us practice making and iterating on model organisms so we’re faster and more effective when we actually have powerful models.)
I agree that if you set out with the goal of “make alignment faking not happen in a 2025 model” you can likely do this pretty easily without having learned anything that will help much for more powerful models. I feel more optimistic about doing science on the conditions under which 2025 models not particularly trained for or against AF exhibit it, and this telling us useful things about risk factors that would apply to future models? Though I think it’s plausible that most of the value is in model organism creation, as you say.