Yes, of course, sorry. I should have said: I think detecting them is (pretty easy and) far from sufficient. Indeed, we have detected them (sandbagging only somewhat) and yes this gives you something to try interventions on but, like, nobody knows how to solve e.g. alignment faking. I feel good about model organisms work but [pessimistic/uneasy/something] about the bouncing off alignment audits vibe.
Edit: maybe ideally I would criticize specific work as not-a-priority. I don’t have specific work to criticize right now (besides interp on the margin), but I don’t really know what work has been motivated by “bouncing off bumpers” or “alignment auditing.” For now, I’ll observe that the vibe is worrying to me and I worry about the focus on showing that a model is safe relative to improving safety.[1] And, like, I haven’t heard a story for how alignment auditing will solve [alignment faking or sandbagging or whatever], besides maybe the undesired behavior derives from bad data or reward functions or whatever and it’s just feasible to trace the undesired behavior back to that and fix it (this sounds false but I don’t have good intuitions here and would mostly defer if non-Anthropic people were optimistic).
The vibes—at least from some Anthropic safety people, at least historically—have been like if we can’t show safety then we can just not deploy. In the unrushed regime, don’t deploy is a great affordance. In the rushed regime, where you’re the safest developer and another developer will deploy a more dangerous model 2 months later, it’s not good. Given that we’re in the rushed regime, more effort should go toward decreasing danger relative to measuring danger.
Here are some ways I think alignment auditing style work can help with decreasing danger:
Better metrics for early detection means better science that you can do on dumber models, better ability to tell which interventions work, etc. I think Why Do Some Language Models Fake Alignment While Others Don’t? is the kind of thing that’s pretty helpful for working on mitigations!
Forecasting which issues are going to become a serious problem under further scaling, eg by saying “ok Model 1 had 1% frequency in really contrived settings, Model 2 had 5% frequency in only-mildly-contrived settings, Model 3 was up to 30% in the mildly-contrived setting and we even saw a couple cases in realistic environments”, lets you prioritize your danger-decreasing work better by having a sense of what’s on the horizon. I want to be able to elicit this stuff from the dumbest/earliest models possible, and I think getting a mature science of alignment auditing is really helpful for that.
Alignment auditing might help in constructing model organisms, or finding natural behavior you might be able to train a model organism to exhibit much more severely.
Maybe one frame is that audits can have both breadth and depth, and lot of what I’m excited about isn’t just “get wide coverage of model behavior looking for sketchy stuff” but also “have a really good sense of exactly how and where a given behavior happens, in a way you can compare across models and track what’s getting better or worse”.
I think Why Do Some Language Models Fake Alignment While Others Don’t? is the kind of thing that’s pretty helpful for working on mitigations!
I’m pretty skeptical about mitigations work targeting alignment faking in current models transfering very much to future models.
(I’m more optimistic about this type of work helping us practice making and iterating on model organisms so we’re faster and more effective when we actually have powerful models.)
I agree that if you set out with the goal of “make alignment faking not happen in a 2025 model” you can likely do this pretty easily without having learned anything that will help much for more powerful models. I feel more optimistic about doing science on the conditions under which 2025 models not particularly trained for or against AF exhibit it, and this telling us useful things about risk factors that would apply to future models? Though I think it’s plausible that most of the value is in model organism creation, as you say.
I would argue that all fixing research is accelerated by having found examples, because it gives you better feedback on whether you’ve found or made progress towards fixes, by studying what happened on your examples. (So long as you are careful not to overfit and just fit that example or something). I wouldn’t confidently argue that it can more directly help by eg helping you find the root cause, though things like “training data attribution to the problematic data, remove it, and start fine tuning again” might just work
Yes, of course, sorry. I should have said: I think detecting them is (pretty easy and) far from sufficient. Indeed, we have detected them (sandbagging only somewhat) and yes this gives you something to try interventions on but, like, nobody knows how to solve e.g. alignment faking. I feel good about model organisms work but [pessimistic/uneasy/something] about the bouncing off alignment audits vibe.
Edit: maybe ideally I would criticize specific work as not-a-priority. I don’t have specific work to criticize right now (besides interp on the margin), but I don’t really know what work has been motivated by “bouncing off bumpers” or “alignment auditing.” For now, I’ll observe that the vibe is worrying to me and I worry about the focus on showing that a model is safe relative to improving safety.[1] And, like, I haven’t heard a story for how alignment auditing will solve [alignment faking or sandbagging or whatever], besides maybe the undesired behavior derives from bad data or reward functions or whatever and it’s just feasible to trace the undesired behavior back to that and fix it (this sounds false but I don’t have good intuitions here and would mostly defer if non-Anthropic people were optimistic).
The vibes—at least from some Anthropic safety people, at least historically—have been like if we can’t show safety then we can just not deploy. In the unrushed regime, don’t deploy is a great affordance. In the rushed regime, where you’re the safest developer and another developer will deploy a more dangerous model 2 months later, it’s not good. Given that we’re in the rushed regime, more effort should go toward decreasing danger relative to measuring danger.
Here are some ways I think alignment auditing style work can help with decreasing danger:
Better metrics for early detection means better science that you can do on dumber models, better ability to tell which interventions work, etc. I think Why Do Some Language Models Fake Alignment While Others Don’t? is the kind of thing that’s pretty helpful for working on mitigations!
Forecasting which issues are going to become a serious problem under further scaling, eg by saying “ok Model 1 had 1% frequency in really contrived settings, Model 2 had 5% frequency in only-mildly-contrived settings, Model 3 was up to 30% in the mildly-contrived setting and we even saw a couple cases in realistic environments”, lets you prioritize your danger-decreasing work better by having a sense of what’s on the horizon. I want to be able to elicit this stuff from the dumbest/earliest models possible, and I think getting a mature science of alignment auditing is really helpful for that.
Alignment auditing might help in constructing model organisms, or finding natural behavior you might be able to train a model organism to exhibit much more severely.
Maybe one frame is that audits can have both breadth and depth, and lot of what I’m excited about isn’t just “get wide coverage of model behavior looking for sketchy stuff” but also “have a really good sense of exactly how and where a given behavior happens, in a way you can compare across models and track what’s getting better or worse”.
I’m pretty skeptical about mitigations work targeting alignment faking in current models transfering very much to future models.
(I’m more optimistic about this type of work helping us practice making and iterating on model organisms so we’re faster and more effective when we actually have powerful models.)
I agree that if you set out with the goal of “make alignment faking not happen in a 2025 model” you can likely do this pretty easily without having learned anything that will help much for more powerful models. I feel more optimistic about doing science on the conditions under which 2025 models not particularly trained for or against AF exhibit it, and this telling us useful things about risk factors that would apply to future models? Though I think it’s plausible that most of the value is in model organism creation, as you say.
I would argue that all fixing research is accelerated by having found examples, because it gives you better feedback on whether you’ve found or made progress towards fixes, by studying what happened on your examples. (So long as you are careful not to overfit and just fit that example or something). I wouldn’t confidently argue that it can more directly help by eg helping you find the root cause, though things like “training data attribution to the problematic data, remove it, and start fine tuning again” might just work