What happens in your plan if it takes five years to solve the safety evaluation/deception problem for LLMs (i.e. it’s extremely hard)?
Do you have an estimate of P({China; Russia; Iran; North Korea} steals an ASL-3 system with ASL-3 security measures)? Conditional on one of these countries having the system, what’s your guess of p(catastrophe)?
Do you mind pointing me to the section? I skimmed your post again, and the only relevant thing I saw was this part:
Seeing the existing RSP system in place at labs, governments step in and use it as a basis to enact hard regulation.
By the time it is necessary to codify exactly what safety metrics are required for scaling past models that pose a potential takeover risk, we have clearly solved the problem of understanding-based evals and know what it would take to demonstrate sufficient understanding of a model to rule out e.g. deceptive alignment.
Understanding-based evals are adopted by governmental RSP regimes as hard gating evaluations for models that pose a potential takeover risk.
Once labs start to reach models that pose a potential takeover risk, they either:
Solve mechanistic interpretability to a sufficient extent that they are able to pass an understanding-based eval and demonstrate that their models are safe.
Get blocked on scaling until mechanistic interpretability is solved, forcing a reroute of resources from scaling to interpretability.
My summary of this is something like “maybe voluntary RSPs will make it more likely for governments to force people to do evals. And not just the inadequate dangerous capabilities evals we have now– but also the better understanding-based evals that are not yet developed, but hopefully we will have solved some technical problems in time.”
I think this is better than no government regulation, but the main problem (if I’m understanding this correctly) is that it relies on evals that we do not have.
IMO, a more common-sense approach would be “let’s stop until we are confident that we can proceed safely”, and I’m more excited about those who are pushing for this position.
Aside: I don’t mean to nitpick your wording, but I think a “full plan” would involve many more details. In the absence of those details, it’s hard to evaluate the plan. Examples of some details that would need to be ironed out:
Which systems are licensed under this regime? Who defines what a “model that poses a potential takeover risk” is, and how do we have inclusion criteria that are flexible enough to account for algorithmic improvement?
Who in the government is doing this?
Do we have an international body that is making sure that various countries comply?
How do we make sure the regulator doesn’t get captured?
What does solving mechanistic interpretability mean, and who is determining that?
To be clear I don’t think you need to specify all of this, and some of these are pretty specific/nit-picky, but I don’t think you should be calling this a “full plan.”
I agree that this is a problem, but it strikes me that we wouldn’t necessarily need a concrete eval—i.e. we wouldn’t need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently].
We could have [here is a precise description of what we mean by “understanding a model”, such that we could, in principle, create an evaluation process that answers this question].
We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it’s not obvious to me that defining the right question isn’t already most of the work)
Personally, I’d prefer that this were done already—i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved—e.g. deferring to an external board. [Anthropic’s Long Term Benefit Trust doesn’t seem great for this, since it’s essentially just Paul who’d have relevant expertise (?? I’m not sure about this—it’s just unclear that any of the others would)]
I do think it’s reasonable for labs to say that they wouldn’t do this kind of thing unilaterally—but I would want them to push for a more comprehensive setup when it comes to policy.
I did—I lay out a plan for how to get from where we are now to a state where AI goes well from a policy perspective in my RSP post.
Two questions related to it:
What happens in your plan if it takes five years to solve the safety evaluation/deception problem for LLMs (i.e. it’s extremely hard)?
Do you have an estimate of P({China; Russia; Iran; North Korea} steals an ASL-3 system with ASL-3 security measures)? Conditional on one of these countries having the system, what’s your guess of p(catastrophe)?
Do you mind pointing me to the section? I skimmed your post again, and the only relevant thing I saw was this part:
My summary of this is something like “maybe voluntary RSPs will make it more likely for governments to force people to do evals. And not just the inadequate dangerous capabilities evals we have now– but also the better understanding-based evals that are not yet developed, but hopefully we will have solved some technical problems in time.”
I think this is better than no government regulation, but the main problem (if I’m understanding this correctly) is that it relies on evals that we do not have.
IMO, a more common-sense approach would be “let’s stop until we are confident that we can proceed safely”, and I’m more excited about those who are pushing for this position.
Aside: I don’t mean to nitpick your wording, but I think a “full plan” would involve many more details. In the absence of those details, it’s hard to evaluate the plan. Examples of some details that would need to be ironed out:
Which systems are licensed under this regime? Who defines what a “model that poses a potential takeover risk” is, and how do we have inclusion criteria that are flexible enough to account for algorithmic improvement?
Who in the government is doing this?
Do we have an international body that is making sure that various countries comply?
How do we make sure the regulator doesn’t get captured?
What does solving mechanistic interpretability mean, and who is determining that?
To be clear I don’t think you need to specify all of this, and some of these are pretty specific/nit-picky, but I don’t think you should be calling this a “full plan.”
I agree that this is a problem, but it strikes me that we wouldn’t necessarily need a concrete eval—i.e. we wouldn’t need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently].
We could have [here is a precise description of what we mean by “understanding a model”, such that we could, in principle, create an evaluation process that answers this question].
We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it’s not obvious to me that defining the right question isn’t already most of the work)
Personally, I’d prefer that this were done already—i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved—e.g. deferring to an external board. [Anthropic’s Long Term Benefit Trust doesn’t seem great for this, since it’s essentially just Paul who’d have relevant expertise (?? I’m not sure about this—it’s just unclear that any of the others would)]
I do think it’s reasonable for labs to say that they wouldn’t do this kind of thing unilaterally—but I would want them to push for a more comprehensive setup when it comes to policy.