I found this post very frustrating, because it’s almost all dedicated to whether current RSPs are sufficient or not (I agree that they are insufficient), but that’s not my crux and I don’t think it’s anyone else’s crux either. And for what I think is probably the actual crux here, you only have one small throwaway paragraph:
Which brings us to the question: “what’s the effect of RSPs on policy and would it be good if governments implemented those”. My answer to that is: An extremely ambitious version yes; the misleading version, no. No, mostly because of the short time we have before we see heightened levels of risks, which gives us very little time to update regulations, which is a core assumption on which RSPs are relying without providing evidence of being realistic.
As I’ve talkedabout nowextensively, I think enacting RSPs in policy now makes it easier not harder to get even better future regulations enacted. It seems that your main reason for disagreement is that you believe in extremely short timelines / fast takeoff, such that we will never get future opportunities to revise AI regulation. That seems pretty unlikely to me: my expectation especially is that as AI continues to heat up in terms of its economic impact, new policy windows will keep arising in rapid succession, and that we will see many of them before the end of days.
I’m sorry for that magnitude of misunderstanding, and will try to clarify it upfront in the post, but a large part of my argument is about why the principles of RSPs are not good enough, rather than the specific implementation (which is also not sufficient though, and which I argue in “Overselling, underdelivering” is one of the flaws of the framework and not just a problem that will pass).
You can check Section 3 for why I think that the principles are flawed, and Section 1 and 2 to get a better sense of what better principles look like.
Regarding the timeline, I think that it’s unreasonable to expect major framework changes over less than 5 years. And as I wrote, if you think otherwise, I’d love to hear any example of that happening in the past and the conditions under which it happened.
I do think that within the RSP framework, you can maybe get better but as I argue in Section 3, I think the framework is fundamentally flawed and should be replaced by a standard risk management framework, in which we include evals.
@evhub I think it’s great when you and other RSP supporters make it explicit that (a) you don’t think they’re sufficient and (b) you think they can lead to more meaningful regulation.
With that in mind, I think the onus is on you (and institutions like Anthropic and ARC) to say what kind of regulations they support & why. And then I think most of the value will come from “what actual regulations are people proposing” and not “what is someone’s stance on this RSP thing which we all agree is insufficient.”
Except for the fact that there are ways to talk about RSPs that are misleading for policymakers and reduce the chance of meaningful regulations. See the end of my comment and see also Simeon’s sections on misleading and how to move forward.
Also, fwiw, I imagine that timelines/takeoff speeds might be relevant cruxes. And IDK if it’s the main disagreement that you have with Siméon, but I don’t think it’s the main disagreement you have with me.
Even if I thought we would have 3 more meaningful policy windows, I would still think that RSPs have not offered a solid frame/foundation for meaningful regulation, I would still think that they are being communicated about poorly, and I would still want people to focus more on proposals for other regulations & focus less on RSPs.
What happens in your plan if it takes five years to solve the safety evaluation/deception problem for LLMs (i.e. it’s extremely hard)?
Do you have an estimate of P({China; Russia; Iran; North Korea} steals an ASL-3 system with ASL-3 security measures)? Conditional on one of these countries having the system, what’s your guess of p(catastrophe)?
Do you mind pointing me to the section? I skimmed your post again, and the only relevant thing I saw was this part:
Seeing the existing RSP system in place at labs, governments step in and use it as a basis to enact hard regulation.
By the time it is necessary to codify exactly what safety metrics are required for scaling past models that pose a potential takeover risk, we have clearly solved the problem of understanding-based evals and know what it would take to demonstrate sufficient understanding of a model to rule out e.g. deceptive alignment.
Understanding-based evals are adopted by governmental RSP regimes as hard gating evaluations for models that pose a potential takeover risk.
Once labs start to reach models that pose a potential takeover risk, they either:
Solve mechanistic interpretability to a sufficient extent that they are able to pass an understanding-based eval and demonstrate that their models are safe.
Get blocked on scaling until mechanistic interpretability is solved, forcing a reroute of resources from scaling to interpretability.
My summary of this is something like “maybe voluntary RSPs will make it more likely for governments to force people to do evals. And not just the inadequate dangerous capabilities evals we have now– but also the better understanding-based evals that are not yet developed, but hopefully we will have solved some technical problems in time.”
I think this is better than no government regulation, but the main problem (if I’m understanding this correctly) is that it relies on evals that we do not have.
IMO, a more common-sense approach would be “let’s stop until we are confident that we can proceed safely”, and I’m more excited about those who are pushing for this position.
Aside: I don’t mean to nitpick your wording, but I think a “full plan” would involve many more details. In the absence of those details, it’s hard to evaluate the plan. Examples of some details that would need to be ironed out:
Which systems are licensed under this regime? Who defines what a “model that poses a potential takeover risk” is, and how do we have inclusion criteria that are flexible enough to account for algorithmic improvement?
Who in the government is doing this?
Do we have an international body that is making sure that various countries comply?
How do we make sure the regulator doesn’t get captured?
What does solving mechanistic interpretability mean, and who is determining that?
To be clear I don’t think you need to specify all of this, and some of these are pretty specific/nit-picky, but I don’t think you should be calling this a “full plan.”
I agree that this is a problem, but it strikes me that we wouldn’t necessarily need a concrete eval—i.e. we wouldn’t need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently].
We could have [here is a precise description of what we mean by “understanding a model”, such that we could, in principle, create an evaluation process that answers this question].
We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it’s not obvious to me that defining the right question isn’t already most of the work)
Personally, I’d prefer that this were done already—i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved—e.g. deferring to an external board. [Anthropic’s Long Term Benefit Trust doesn’t seem great for this, since it’s essentially just Paul who’d have relevant expertise (?? I’m not sure about this—it’s just unclear that any of the others would)]
I do think it’s reasonable for labs to say that they wouldn’t do this kind of thing unilaterally—but I would want them to push for a more comprehensive setup when it comes to policy.
The permanent motte and bailey that RSPs allow (easily defensible: a framework that seems arbitrarily extensible combined with the belief that you can always change stuff in policy, even over few-years timeframe ; hardly defensible: the actual implementations & the communication around RSPs) is one of the concerns I raise explicitly, and what this comment is doing. Here, while I’m talking in large parts about the ARC RSP principles, you say that I’m talking about “current RSPs”. If you mean that we can change even the RSP principles (and not only their applicatino) to anyone who criticizes the principles of RSPs, then it’s a pretty effective way to make something literally impossible to criticize. We could have taken an arbitrary framework, push for it and say “we’ll do better soon, we need wins”. Claiming that we’ll change the framework (not only the application) in 5 years is a very extraordinary claim and does not seem a good reason to start pushing for a bad framework in the first place.
That it’s not true. The “Safe Zone” in ARC graph clearly suggests that ASL-3 are sufficient. The announce of Anthropic says “require safety, security, and operational standards appropriate to a model’s potential for catastrophic risk”. It implies that ASL-3 measures are sufficient, without actually quantifying the risk (one of the core points of my post), even qualitatively.
At a meta level, I find frustrating that the most upvoted comment, your comment, be a comment that hasn’t seriously read the post, still makes a claim about the entire post, and doesn’t address my request for evidence about the core crux (evidence of major framework changes within 5 years). If “extremely short timelines” means 5y, it seems like many have “extremely short timelines”.
As I’ve talkedabout nowextensively, I think enacting RSPs in policy now makes it easier not harder to get even better future regulations enacted.
I appreciate the evidence you’ve provided on this, and in particular I think it’s more than has been provided for the opposite claim and would encourage Simeon and others criticizing RSPs along these lines to provide more evidence (as I commented on a draft of his post).
That being said, I don’t yet find the evidence you’ve provided particularly compelling. I believe you are referring mainly to this section of your posts:
In the theory of political capital, it is a fairly well-established fact that “Everybody Loves a Winner.” That is: the more you succeed at leveraging your influence to get things done, the more influence you get in return. This phenomenon is most thoroughlystudied in the context of the ability of U.S. presidents’ to get their agendas through Congress—contrary to a naive model that might predict that legislative success uses up a president’s influence, what is actually found is the opposite: legislative success engenders future legislative success, greater presidential approval, and long-term gains for the president’s party.
I don’t understand how the links in this section show that “Everybody Loves a Winner” is a fairly well-established fact that translates to the situation of RSPs. The first link is an op-ed that is paywalled. The second link is a 2013 paper with 7 citations. From the abstract it appears to show that US presidents get higher approval ratings when they succeed in passing legislation, and vice versa. The third link is a 2011 paper with 62 citations (which seems higher, not sure how high this is for its field). From the abstract it appears to show that Presidents which pass agendas in Congress help their party win more Congressional seats. These interpretations don’t seem too different from the way you summarized it.
Assuming that this version of “Everybody Loves a Winner” is in fact a well-established fact in the field, it still seems like the claims it’s making might not translate to the RSP context fairly well. In particular, RSPs are a legislative framework on a specific (currently niche) issue of AI safety. The fact that Presidents who in general get things done tend to get other benefits including perhaps getting more things done later doesn’t seem that relevant to the question of to what extent frameworks on a specific issue tend to be “locked in” after being enacted into law, vs. useful blueprints for future iteration (including potentially large revisions to the framework).
Again, I appreciate you at least providing some evidence but it doesn’t seem convincing to me. FWIW my intuitions lean a bit toward your claims (coming from a startup-y background of push out an MVP then iterate from there), but I have a lot of uncertainty.
(This comment is somewhat like an expanded version of my tweet, which also asked for “Any high-quality analyses on whether pushing more ambitious policies generally helps/hurts the more moderate policies, and vice/versa?”. I received answers like “it depends” and “unclear”.)
One reason why I haven’t provided much evidence is that I think it’s substantially harder to give evidence of a “for all” claim (my side of the claim) than a “there exists” (what I ask Evan). I claim that it doesn’t happen that a framework on a niche area evolves so fast without accidents based on what I’ve seen, even in domains with substantial updates, like aviation and nuclear.
I could potentially see it happening with large accidents, but I personally don’t want to bet on that and I would want it to be transparent if that’s the assumption. I also don’t buy the “small coordinations allow larger coordinations” for domain-specific policy. Beyond what you said above, my sense is that policymakers satisfice and hence tend to not come back on a policy that sucks if that’s sufficiently good-looking to stakeholders to not have substantial incentives to change.
GDPR cookies banner sucks for everyone and haven’t been updated yet, 7 years after GDPR. Standards in the EU are not even updated more rapidly than 5y by default (I’m talking about standards, not regulation), and we’ll have to bargain to try to bring it down to reasonable timeframes AI-specific.
IAEA & safety in nuclear upgraded substantially after each accident, likewise for aviation but we’re talking about decades, not 5 years.
Cross-posted with the EA Forum.
I found this post very frustrating, because it’s almost all dedicated to whether current RSPs are sufficient or not (I agree that they are insufficient), but that’s not my crux and I don’t think it’s anyone else’s crux either. And for what I think is probably the actual crux here, you only have one small throwaway paragraph:
As I’ve talked about now extensively, I think enacting RSPs in policy now makes it easier not harder to get even better future regulations enacted. It seems that your main reason for disagreement is that you believe in extremely short timelines / fast takeoff, such that we will never get future opportunities to revise AI regulation. That seems pretty unlikely to me: my expectation especially is that as AI continues to heat up in terms of its economic impact, new policy windows will keep arising in rapid succession, and that we will see many of them before the end of days.
You can see this section which talks about the points you raise.
Thanks for your comment.
I’m sorry for that magnitude of misunderstanding, and will try to clarify it upfront in the post, but a large part of my argument is about why the principles of RSPs are not good enough, rather than the specific implementation (which is also not sufficient though, and which I argue in “Overselling, underdelivering” is one of the flaws of the framework and not just a problem that will pass).
You can check Section 3 for why I think that the principles are flawed, and Section 1 and 2 to get a better sense of what better principles look like.
Regarding the timeline, I think that it’s unreasonable to expect major framework changes over less than 5 years. And as I wrote, if you think otherwise, I’d love to hear any example of that happening in the past and the conditions under which it happened.
I do think that within the RSP framework, you can maybe get better but as I argue in Section 3, I think the framework is fundamentally flawed and should be replaced by a standard risk management framework, in which we include evals.
@evhub I think it’s great when you and other RSP supporters make it explicit that (a) you don’t think they’re sufficient and (b) you think they can lead to more meaningful regulation.
With that in mind, I think the onus is on you (and institutions like Anthropic and ARC) to say what kind of regulations they support & why. And then I think most of the value will come from “what actual regulations are people proposing” and not “what is someone’s stance on this RSP thing which we all agree is insufficient.”
Except for the fact that there are ways to talk about RSPs that are misleading for policymakers and reduce the chance of meaningful regulations. See the end of my comment and see also Simeon’s sections on misleading and how to move forward.
Also, fwiw, I imagine that timelines/takeoff speeds might be relevant cruxes. And IDK if it’s the main disagreement that you have with Siméon, but I don’t think it’s the main disagreement you have with me.
Even if I thought we would have 3 more meaningful policy windows, I would still think that RSPs have not offered a solid frame/foundation for meaningful regulation, I would still think that they are being communicated about poorly, and I would still want people to focus more on proposals for other regulations & focus less on RSPs.
I did—I lay out a plan for how to get from where we are now to a state where AI goes well from a policy perspective in my RSP post.
Two questions related to it:
What happens in your plan if it takes five years to solve the safety evaluation/deception problem for LLMs (i.e. it’s extremely hard)?
Do you have an estimate of P({China; Russia; Iran; North Korea} steals an ASL-3 system with ASL-3 security measures)? Conditional on one of these countries having the system, what’s your guess of p(catastrophe)?
Do you mind pointing me to the section? I skimmed your post again, and the only relevant thing I saw was this part:
My summary of this is something like “maybe voluntary RSPs will make it more likely for governments to force people to do evals. And not just the inadequate dangerous capabilities evals we have now– but also the better understanding-based evals that are not yet developed, but hopefully we will have solved some technical problems in time.”
I think this is better than no government regulation, but the main problem (if I’m understanding this correctly) is that it relies on evals that we do not have.
IMO, a more common-sense approach would be “let’s stop until we are confident that we can proceed safely”, and I’m more excited about those who are pushing for this position.
Aside: I don’t mean to nitpick your wording, but I think a “full plan” would involve many more details. In the absence of those details, it’s hard to evaluate the plan. Examples of some details that would need to be ironed out:
Which systems are licensed under this regime? Who defines what a “model that poses a potential takeover risk” is, and how do we have inclusion criteria that are flexible enough to account for algorithmic improvement?
Who in the government is doing this?
Do we have an international body that is making sure that various countries comply?
How do we make sure the regulator doesn’t get captured?
What does solving mechanistic interpretability mean, and who is determining that?
To be clear I don’t think you need to specify all of this, and some of these are pretty specific/nit-picky, but I don’t think you should be calling this a “full plan.”
I agree that this is a problem, but it strikes me that we wouldn’t necessarily need a concrete eval—i.e. we wouldn’t need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently].
We could have [here is a precise description of what we mean by “understanding a model”, such that we could, in principle, create an evaluation process that answers this question].
We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it’s not obvious to me that defining the right question isn’t already most of the work)
Personally, I’d prefer that this were done already—i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved—e.g. deferring to an external board. [Anthropic’s Long Term Benefit Trust doesn’t seem great for this, since it’s essentially just Paul who’d have relevant expertise (?? I’m not sure about this—it’s just unclear that any of the others would)]
I do think it’s reasonable for labs to say that they wouldn’t do this kind of thing unilaterally—but I would want them to push for a more comprehensive setup when it comes to policy.
A more in-depth answer:
The permanent motte and bailey that RSPs allow (easily defensible: a framework that seems arbitrarily extensible combined with the belief that you can always change stuff in policy, even over few-years timeframe ; hardly defensible: the actual implementations & the communication around RSPs) is one of the concerns I raise explicitly, and what this comment is doing. Here, while I’m talking in large parts about the ARC RSP principles, you say that I’m talking about “current RSPs”. If you mean that we can change even the RSP principles (and not only their applicatino) to anyone who criticizes the principles of RSPs, then it’s a pretty effective way to make something literally impossible to criticize. We could have taken an arbitrary framework, push for it and say “we’ll do better soon, we need wins”. Claiming that we’ll change the framework (not only the application) in 5 years is a very extraordinary claim and does not seem a good reason to start pushing for a bad framework in the first place.
That it’s not true. The “Safe Zone” in ARC graph clearly suggests that ASL-3 are sufficient. The announce of Anthropic says “require safety, security, and operational standards appropriate to a model’s potential for catastrophic risk”. It implies that ASL-3 measures are sufficient, without actually quantifying the risk (one of the core points of my post), even qualitatively.
At a meta level, I find frustrating that the most upvoted comment, your comment, be a comment that hasn’t seriously read the post, still makes a claim about the entire post, and doesn’t address my request for evidence about the core crux (evidence of major framework changes within 5 years). If “extremely short timelines” means 5y, it seems like many have “extremely short timelines”.
I appreciate the evidence you’ve provided on this, and in particular I think it’s more than has been provided for the opposite claim and would encourage Simeon and others criticizing RSPs along these lines to provide more evidence (as I commented on a draft of his post).
That being said, I don’t yet find the evidence you’ve provided particularly compelling. I believe you are referring mainly to this section of your posts:
I don’t understand how the links in this section show that “Everybody Loves a Winner” is a fairly well-established fact that translates to the situation of RSPs. The first link is an op-ed that is paywalled. The second link is a 2013 paper with 7 citations. From the abstract it appears to show that US presidents get higher approval ratings when they succeed in passing legislation, and vice versa. The third link is a 2011 paper with 62 citations (which seems higher, not sure how high this is for its field). From the abstract it appears to show that Presidents which pass agendas in Congress help their party win more Congressional seats. These interpretations don’t seem too different from the way you summarized it.
Assuming that this version of “Everybody Loves a Winner” is in fact a well-established fact in the field, it still seems like the claims it’s making might not translate to the RSP context fairly well. In particular, RSPs are a legislative framework on a specific (currently niche) issue of AI safety. The fact that Presidents who in general get things done tend to get other benefits including perhaps getting more things done later doesn’t seem that relevant to the question of to what extent frameworks on a specific issue tend to be “locked in” after being enacted into law, vs. useful blueprints for future iteration (including potentially large revisions to the framework).
Again, I appreciate you at least providing some evidence but it doesn’t seem convincing to me. FWIW my intuitions lean a bit toward your claims (coming from a startup-y background of push out an MVP then iterate from there), but I have a lot of uncertainty.
(This comment is somewhat like an expanded version of my tweet, which also asked for “Any high-quality analyses on whether pushing more ambitious policies generally helps/hurts the more moderate policies, and vice/versa?”. I received answers like “it depends” and “unclear”.)
Thanks Eli for the comment.
One reason why I haven’t provided much evidence is that I think it’s substantially harder to give evidence of a “for all” claim (my side of the claim) than a “there exists” (what I ask Evan). I claim that it doesn’t happen that a framework on a niche area evolves so fast without accidents based on what I’ve seen, even in domains with substantial updates, like aviation and nuclear.
I could potentially see it happening with large accidents, but I personally don’t want to bet on that and I would want it to be transparent if that’s the assumption. I also don’t buy the “small coordinations allow larger coordinations” for domain-specific policy. Beyond what you said above, my sense is that policymakers satisfice and hence tend to not come back on a policy that sucks if that’s sufficiently good-looking to stakeholders to not have substantial incentives to change.
GDPR cookies banner sucks for everyone and haven’t been updated yet, 7 years after GDPR. Standards in the EU are not even updated more rapidly than 5y by default (I’m talking about standards, not regulation), and we’ll have to bargain to try to bring it down to reasonable timeframes AI-specific.
IAEA & safety in nuclear upgraded substantially after each accident, likewise for aviation but we’re talking about decades, not 5 years.