Get to near-0 failure in alignment-loaded tasks that are within the capabilities of the model.
That is, when we run various safety evals, I’d like it if the models genuinely scored near-0. I’d also like it if the models ~never refused improperly, ~never answered when they should have refused, ~never precipitated psychosis, ~never deleted whole codebases, ~never lied in the CoT, and similar.
These are all behavioral standards, and are all problems that I’m told we’ll keep under control. I’d like the capacity for us to have them under control demonstrated currently, as a precondition of advancing the frontier.
So far, I don’t see that the prosaic plans work in the easier, near-term cases, and am being asked to believe they’ll work in the much harder future cases. They may work ‘well enough’ now, but the concern is precisely that ‘well enough’ will be insufficient in the limit.
An alternative condition is ‘full human interpretability of GPT-2 Small’.
This probably wouldn’t change my all-things-considered view, but this would substantially ‘modify my expectations’, and make me think the world was much more sane than today’s world.
I know you’re pointing out the easier case still not working, but I just want to caution against the “drive it to zero” mentality, since I worry strongly that it’s the exact mentality researchers often have.
When that’s your mental model, reducing rates will seem like progress.
The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude
IMO the default failure mode here is:
We do observe them (or early versions of them)
The lab underinvests in the problem
It becomes enough of a problem that its painful to product or internal capabilities usage
We didn’t invest enough to actually solve the underlying problem, and we can’t afford to not use the model while we wait for alignment research to catch up
The lab patches over the problem with some “reduces but does not eliminate” technique
The model is then usable, but with harder to detect misalignment
Scale capabilities and repeat
This is the exact loop we’re in now, and the dynamics only intensify with time and capabilities.
The situation you’re describing definitely concerns me, and is about mid-way up the hierarchy of nested problems as I see it (I don’t mean ‘hierarchy of importance’ I mean ’spectrum from object-level-empirical-work to realm-of-pure-abstraction).
I tried to capture this at the end of my comment, by saying that even success as I outlined it probably wouldn’t change my all-things-considered view (because there’s a whole suite of nested problems at other levels of abstraction, including the one you named), but it would at least update me toward the plausibility of the case they’re making.
As is, their own tests say they’re doing poorly, and they’ll probably want to fix that in good faith before they try tackling the kind of dynamic group epistemic failures that you’re pointing at.
Get to near-0 failure in alignment-loaded tasks that are within the capabilities of the model.
That is, when we run various safety evals, I’d like it if the models genuinely scored near-0. I’d also like it if the models ~never refused improperly, ~never answered when they should have refused, ~never precipitated psychosis, ~never deleted whole codebases, ~never lied in the CoT, and similar.
These are all behavioral standards, and are all problems that I’m told we’ll keep under control. I’d like the capacity for us to have them under control demonstrated currently, as a precondition of advancing the frontier.
So far, I don’t see that the prosaic plans work in the easier, near-term cases, and am being asked to believe they’ll work in the much harder future cases. They may work ‘well enough’ now, but the concern is precisely that ‘well enough’ will be insufficient in the limit.
An alternative condition is ‘full human interpretability of GPT-2 Small’.
This probably wouldn’t change my all-things-considered view, but this would substantially ‘modify my expectations’, and make me think the world was much more sane than today’s world.
I know you’re pointing out the easier case still not working, but I just want to caution against the “drive it to zero” mentality, since I worry strongly that it’s the exact mentality researchers often have.
When that’s your mental model, reducing rates will seem like progress.
IMO the default failure mode here is:
We do observe them (or early versions of them)
The lab underinvests in the problem
It becomes enough of a problem that its painful to product or internal capabilities usage
We didn’t invest enough to actually solve the underlying problem, and we can’t afford to not use the model while we wait for alignment research to catch up
The lab patches over the problem with some “reduces but does not eliminate” technique
The model is then usable, but with harder to detect misalignment
Scale capabilities and repeat
This is the exact loop we’re in now, and the dynamics only intensify with time and capabilities.
The situation you’re describing definitely concerns me, and is about mid-way up the hierarchy of nested problems as I see it (I don’t mean ‘hierarchy of importance’ I mean ’spectrum from object-level-empirical-work to realm-of-pure-abstraction).
I tried to capture this at the end of my comment, by saying that even success as I outlined it probably wouldn’t change my all-things-considered view (because there’s a whole suite of nested problems at other levels of abstraction, including the one you named), but it would at least update me toward the plausibility of the case they’re making.
As is, their own tests say they’re doing poorly, and they’ll probably want to fix that in good faith before they try tackling the kind of dynamic group epistemic failures that you’re pointing at.