Here are some ways I think alignment auditing style work can help with decreasing danger:
Better metrics for early detection means better science that you can do on dumber models, better ability to tell which interventions work, etc. I think Why Do Some Language Models Fake Alignment While Others Don’t? is the kind of thing that’s pretty helpful for working on mitigations!
Forecasting which issues are going to become a serious problem under further scaling, eg by saying “ok Model 1 had 1% frequency in really contrived settings, Model 2 had 5% frequency in only-mildly-contrived settings, Model 3 was up to 30% in the mildly-contrived setting and we even saw a couple cases in realistic environments”, lets you prioritize your danger-decreasing work better by having a sense of what’s on the horizon. I want to be able to elicit this stuff from the dumbest/earliest models possible, and I think getting a mature science of alignment auditing is really helpful for that.
Alignment auditing might help in constructing model organisms, or finding natural behavior you might be able to train a model organism to exhibit much more severely.
Maybe one frame is that audits can have both breadth and depth, and lot of what I’m excited about isn’t just “get wide coverage of model behavior looking for sketchy stuff” but also “have a really good sense of exactly how and where a given behavior happens, in a way you can compare across models and track what’s getting better or worse”.
I think Why Do Some Language Models Fake Alignment While Others Don’t? is the kind of thing that’s pretty helpful for working on mitigations!
I’m pretty skeptical about mitigations work targeting alignment faking in current models transfering very much to future models.
(I’m more optimistic about this type of work helping us practice making and iterating on model organisms so we’re faster and more effective when we actually have powerful models.)
I agree that if you set out with the goal of “make alignment faking not happen in a 2025 model” you can likely do this pretty easily without having learned anything that will help much for more powerful models. I feel more optimistic about doing science on the conditions under which 2025 models not particularly trained for or against AF exhibit it, and this telling us useful things about risk factors that would apply to future models? Though I think it’s plausible that most of the value is in model organism creation, as you say.
Here are some ways I think alignment auditing style work can help with decreasing danger:
Better metrics for early detection means better science that you can do on dumber models, better ability to tell which interventions work, etc. I think Why Do Some Language Models Fake Alignment While Others Don’t? is the kind of thing that’s pretty helpful for working on mitigations!
Forecasting which issues are going to become a serious problem under further scaling, eg by saying “ok Model 1 had 1% frequency in really contrived settings, Model 2 had 5% frequency in only-mildly-contrived settings, Model 3 was up to 30% in the mildly-contrived setting and we even saw a couple cases in realistic environments”, lets you prioritize your danger-decreasing work better by having a sense of what’s on the horizon. I want to be able to elicit this stuff from the dumbest/earliest models possible, and I think getting a mature science of alignment auditing is really helpful for that.
Alignment auditing might help in constructing model organisms, or finding natural behavior you might be able to train a model organism to exhibit much more severely.
Maybe one frame is that audits can have both breadth and depth, and lot of what I’m excited about isn’t just “get wide coverage of model behavior looking for sketchy stuff” but also “have a really good sense of exactly how and where a given behavior happens, in a way you can compare across models and track what’s getting better or worse”.
I’m pretty skeptical about mitigations work targeting alignment faking in current models transfering very much to future models.
(I’m more optimistic about this type of work helping us practice making and iterating on model organisms so we’re faster and more effective when we actually have powerful models.)
I agree that if you set out with the goal of “make alignment faking not happen in a 2025 model” you can likely do this pretty easily without having learned anything that will help much for more powerful models. I feel more optimistic about doing science on the conditions under which 2025 models not particularly trained for or against AF exhibit it, and this telling us useful things about risk factors that would apply to future models? Though I think it’s plausible that most of the value is in model organism creation, as you say.