@evhub I think it’s great when you and other RSP supporters make it explicit that (a) you don’t think they’re sufficient and (b) you think they can lead to more meaningful regulation.
With that in mind, I think the onus is on you (and institutions like Anthropic and ARC) to say what kind of regulations they support & why. And then I think most of the value will come from “what actual regulations are people proposing” and not “what is someone’s stance on this RSP thing which we all agree is insufficient.”
Except for the fact that there are ways to talk about RSPs that are misleading for policymakers and reduce the chance of meaningful regulations. See the end of my comment and see also Simeon’s sections on misleading and how to move forward.
Also, fwiw, I imagine that timelines/takeoff speeds might be relevant cruxes. And IDK if it’s the main disagreement that you have with Siméon, but I don’t think it’s the main disagreement you have with me.
Even if I thought we would have 3 more meaningful policy windows, I would still think that RSPs have not offered a solid frame/foundation for meaningful regulation, I would still think that they are being communicated about poorly, and I would still want people to focus more on proposals for other regulations & focus less on RSPs.
What happens in your plan if it takes five years to solve the safety evaluation/deception problem for LLMs (i.e. it’s extremely hard)?
Do you have an estimate of P({China; Russia; Iran; North Korea} steals an ASL-3 system with ASL-3 security measures)? Conditional on one of these countries having the system, what’s your guess of p(catastrophe)?
Do you mind pointing me to the section? I skimmed your post again, and the only relevant thing I saw was this part:
Seeing the existing RSP system in place at labs, governments step in and use it as a basis to enact hard regulation.
By the time it is necessary to codify exactly what safety metrics are required for scaling past models that pose a potential takeover risk, we have clearly solved the problem of understanding-based evals and know what it would take to demonstrate sufficient understanding of a model to rule out e.g. deceptive alignment.
Understanding-based evals are adopted by governmental RSP regimes as hard gating evaluations for models that pose a potential takeover risk.
Once labs start to reach models that pose a potential takeover risk, they either:
Solve mechanistic interpretability to a sufficient extent that they are able to pass an understanding-based eval and demonstrate that their models are safe.
Get blocked on scaling until mechanistic interpretability is solved, forcing a reroute of resources from scaling to interpretability.
My summary of this is something like “maybe voluntary RSPs will make it more likely for governments to force people to do evals. And not just the inadequate dangerous capabilities evals we have now– but also the better understanding-based evals that are not yet developed, but hopefully we will have solved some technical problems in time.”
I think this is better than no government regulation, but the main problem (if I’m understanding this correctly) is that it relies on evals that we do not have.
IMO, a more common-sense approach would be “let’s stop until we are confident that we can proceed safely”, and I’m more excited about those who are pushing for this position.
Aside: I don’t mean to nitpick your wording, but I think a “full plan” would involve many more details. In the absence of those details, it’s hard to evaluate the plan. Examples of some details that would need to be ironed out:
Which systems are licensed under this regime? Who defines what a “model that poses a potential takeover risk” is, and how do we have inclusion criteria that are flexible enough to account for algorithmic improvement?
Who in the government is doing this?
Do we have an international body that is making sure that various countries comply?
How do we make sure the regulator doesn’t get captured?
What does solving mechanistic interpretability mean, and who is determining that?
To be clear I don’t think you need to specify all of this, and some of these are pretty specific/nit-picky, but I don’t think you should be calling this a “full plan.”
I agree that this is a problem, but it strikes me that we wouldn’t necessarily need a concrete eval—i.e. we wouldn’t need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently].
We could have [here is a precise description of what we mean by “understanding a model”, such that we could, in principle, create an evaluation process that answers this question].
We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it’s not obvious to me that defining the right question isn’t already most of the work)
Personally, I’d prefer that this were done already—i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved—e.g. deferring to an external board. [Anthropic’s Long Term Benefit Trust doesn’t seem great for this, since it’s essentially just Paul who’d have relevant expertise (?? I’m not sure about this—it’s just unclear that any of the others would)]
I do think it’s reasonable for labs to say that they wouldn’t do this kind of thing unilaterally—but I would want them to push for a more comprehensive setup when it comes to policy.
@evhub I think it’s great when you and other RSP supporters make it explicit that (a) you don’t think they’re sufficient and (b) you think they can lead to more meaningful regulation.
With that in mind, I think the onus is on you (and institutions like Anthropic and ARC) to say what kind of regulations they support & why. And then I think most of the value will come from “what actual regulations are people proposing” and not “what is someone’s stance on this RSP thing which we all agree is insufficient.”
Except for the fact that there are ways to talk about RSPs that are misleading for policymakers and reduce the chance of meaningful regulations. See the end of my comment and see also Simeon’s sections on misleading and how to move forward.
Also, fwiw, I imagine that timelines/takeoff speeds might be relevant cruxes. And IDK if it’s the main disagreement that you have with Siméon, but I don’t think it’s the main disagreement you have with me.
Even if I thought we would have 3 more meaningful policy windows, I would still think that RSPs have not offered a solid frame/foundation for meaningful regulation, I would still think that they are being communicated about poorly, and I would still want people to focus more on proposals for other regulations & focus less on RSPs.
I did—I lay out a plan for how to get from where we are now to a state where AI goes well from a policy perspective in my RSP post.
Two questions related to it:
What happens in your plan if it takes five years to solve the safety evaluation/deception problem for LLMs (i.e. it’s extremely hard)?
Do you have an estimate of P({China; Russia; Iran; North Korea} steals an ASL-3 system with ASL-3 security measures)? Conditional on one of these countries having the system, what’s your guess of p(catastrophe)?
Do you mind pointing me to the section? I skimmed your post again, and the only relevant thing I saw was this part:
My summary of this is something like “maybe voluntary RSPs will make it more likely for governments to force people to do evals. And not just the inadequate dangerous capabilities evals we have now– but also the better understanding-based evals that are not yet developed, but hopefully we will have solved some technical problems in time.”
I think this is better than no government regulation, but the main problem (if I’m understanding this correctly) is that it relies on evals that we do not have.
IMO, a more common-sense approach would be “let’s stop until we are confident that we can proceed safely”, and I’m more excited about those who are pushing for this position.
Aside: I don’t mean to nitpick your wording, but I think a “full plan” would involve many more details. In the absence of those details, it’s hard to evaluate the plan. Examples of some details that would need to be ironed out:
Which systems are licensed under this regime? Who defines what a “model that poses a potential takeover risk” is, and how do we have inclusion criteria that are flexible enough to account for algorithmic improvement?
Who in the government is doing this?
Do we have an international body that is making sure that various countries comply?
How do we make sure the regulator doesn’t get captured?
What does solving mechanistic interpretability mean, and who is determining that?
To be clear I don’t think you need to specify all of this, and some of these are pretty specific/nit-picky, but I don’t think you should be calling this a “full plan.”
I agree that this is a problem, but it strikes me that we wouldn’t necessarily need a concrete eval—i.e. we wouldn’t need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently].
We could have [here is a precise description of what we mean by “understanding a model”, such that we could, in principle, create an evaluation process that answers this question].
We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it’s not obvious to me that defining the right question isn’t already most of the work)
Personally, I’d prefer that this were done already—i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved—e.g. deferring to an external board. [Anthropic’s Long Term Benefit Trust doesn’t seem great for this, since it’s essentially just Paul who’d have relevant expertise (?? I’m not sure about this—it’s just unclear that any of the others would)]
I do think it’s reasonable for labs to say that they wouldn’t do this kind of thing unilaterally—but I would want them to push for a more comprehensive setup when it comes to policy.