I agree that this is a problem, but it strikes me that we wouldn’t necessarily need a concrete eval—i.e. we wouldn’t need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently].
We could have [here is a precise description of what we mean by “understanding a model”, such that we could, in principle, create an evaluation process that answers this question].
We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it’s not obvious to me that defining the right question isn’t already most of the work)
Personally, I’d prefer that this were done already—i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved—e.g. deferring to an external board. [Anthropic’s Long Term Benefit Trust doesn’t seem great for this, since it’s essentially just Paul who’d have relevant expertise (?? I’m not sure about this—it’s just unclear that any of the others would)]
I do think it’s reasonable for labs to say that they wouldn’t do this kind of thing unilaterally—but I would want them to push for a more comprehensive setup when it comes to policy.
I agree that this is a problem, but it strikes me that we wouldn’t necessarily need a concrete eval—i.e. we wouldn’t need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently].
We could have [here is a precise description of what we mean by “understanding a model”, such that we could, in principle, create an evaluation process that answers this question].
We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it’s not obvious to me that defining the right question isn’t already most of the work)
Personally, I’d prefer that this were done already—i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved—e.g. deferring to an external board. [Anthropic’s Long Term Benefit Trust doesn’t seem great for this, since it’s essentially just Paul who’d have relevant expertise (?? I’m not sure about this—it’s just unclear that any of the others would)]
I do think it’s reasonable for labs to say that they wouldn’t do this kind of thing unilaterally—but I would want them to push for a more comprehensive setup when it comes to policy.