@SaferAI
simeon_c
simeon_c’s Shortform
Forecasting future gains due to post-training enhancements
Oh thanks, I hadn’t find it, gonna delete!
Yeah basically Davidad has not only a safety plan but a governance plan which actively aims at making this shift happen!
Davidad’s Provably Safe AI Architecture—ARIA’s Programme Thesis
A Brief Assessment of OpenAI’s Preparedness Framework & Some Suggestions for Improvement
Thanks for writing that. I’ve been trying to taboo “goals” because it creates so much confusion, which this post tries to decrease. In line with this post, I think what matters is how difficult a task is to achieve, and what it takes to achieve it in terms of ability to overcome obstacles.
“Anthropic’s commitment to follow the ASL scheme thus implies that we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.”
And/or = or, so I just want to flag that the actual commitment here cd be as weak as “we delay the deployment but keep scaling internally”. If it’s a mistake, you can correct it, but if it’s not, it doesn’t seem like a robust commitment to pause to me, even assuming that the conditions of pause were well established.
Because it’s meaningless to talk about a “compromise” dismissing one entire side of the people who disagree with you (but only one side!).
Like I could say “global compute thresholds is a robustly good compromise with everyone who disagrees with me”
*Footnote: only those who’re more pessimistic than me.
That may be right but then the claim is wrong. The true claim would be “RSPs seem like a robustly good compromise with people who are more optimistic than me”.
And then the claim becomes not really relevant?
Holden, thanks for this public post.
I would love if you could write something along the lines of what you wrote in “If it were all up to me, the world would pause now—but it isn’t, and I’m more uncertain about whether a “partial pause” is good” at the top of ARC post, which as we discussed and as I wrote in my post would make RSPs more likely to be positive in my opinion by making the policy/voluntary safety commitments distinction clearer.
Regarding
Responsible scaling policies (RSPs) seem like a robustly good compromise with people who have different views from mine
2. It seems like it’s empirically wrong based on the strong pushback RSPs received so that at least you shouldn’t call it “robustly”, unless you mean a kind of modified version that would accommodate the most important parts of the pushback.
3. I feel like overall the way you discuss RSPs here is one of the many instances of people chatting about idealized RSPs that are not specified, and pointed to against disagreement. See below, from my post:
And second, the coexistence of ARC’s RSP framework with the specific RSPs labs implementations allows slack for commitments that are weak within a framework that would in theory allow ambitious commitments. It leads to many arguments of the form:
“That’s the V1. We’ll raise ambition over time”. I’d like to see evidence of that happening over a 5 year timeframe, in any field or industry. I can think of fields, like aviation where it happened over the course of decades, crashes after crashes. But if it’s relying on expectations that there will be large scale accidents, then it should be clear. If it’s relying on the assumption that timelines are long, it should be explicit.
“It’s voluntary, we can’t expect too much and it’s way better than what’s existing”. Sure, but if the level of catastrophic risks is 1% (which several AI risk experts I’ve talked to believe to be the case for ASL-3 systems) and that it gives the impression that risks are covered, then the name “responsible scaling” is heavily misleading policymakers. The adequate name for 1% catastrophic risks would be catastrophic scaling, which is less rosy.
Thanks for the post.
Would your concerns be mostly addressed if ARC had published a suggestion for a much more comprehensive risk management framework, and explicitly said “these are the principles that we want labs’ risk-management proposals to conform to within a few years, but we encourage less-thorough risk management proposals before then, so that we can get some commitments on the table ASAP, and so that labs can iterate in public. And such less-thorough risk management proposals should prioritize covering x, y, z.”
Great question! A few points:
Yes, many of the things I point are “how to do things well” and I would in fact much prefer something that contains a section “we are striving towards that and our current effort is insufficient” than the current RSP communication which is more “here’s how to responsibly scale”.
That said, I think we disagree on the reference class of the effort (you say “a few years”). I think that you could do a very solid MVP of what I suggest with like 5 FTEs over 6 months.
As I wrote in “How to move forward” (worth skimming to understand what I’d change) I think that RSPs would be incredibly better if they:
had a different name
said that they are insufficient
linked to a post which says “here’s the actual thing which is needed to make us safe”.
Answer to your question: if I were optimizing in the paradigm of voluntary lab commitments as ARC is, yes I would much prefer that. I flagged early though that because labs are definitely not allies on this (because an actual risk assessment is likely to output “stop”), I think the “ask labs kindly” strategy is pretty doomed and I would much prefer a version of ARC trying to acquire bargaining power through a way or another (policy, PR threat etc.) rather than adapting their framework until labs accept to sign them.
Regarding
If people took your proposal as a minimum bar for how thorough a risk management proposal would be, before publishing, it seems like that would interfere with labs being able to “post the work they are doing as they do it, so people can give feedback and input”.
I don’t think it’s necessarily right, e.g. “the ISO standard asks the organization to define risk thresholds” could be a very simple task, much simpler than developing a full eval. The tricky thing is just to ensure we comply with such levels (and the inability to do that obviously reveals a lack of safety).
“ISO proposes a much more comprehensive procedure than RSPs”, it’s not right either that it would take longer, it’s just that there exists risk management tools, that you can run in like a few days, that helps having a very broad coverage of the scenario set.
“imply significant chances to be stolen by Russia or China (...). What are the risks downstream of that?” once again you can cover the most obvious things in like a couple pages. Writing “Maybe they would give the weights to their team of hackers, which increases substantially the chances of leak and global cyberoffence increase”. And I would be totally fine with half-baked things if they were communicated as such and not as RSPs are.
Two questions related to it:
What happens in your plan if it takes five years to solve the safety evaluation/deception problem for LLMs (i.e. it’s extremely hard)?
Do you have an estimate of P({China; Russia; Iran; North Korea} steals an ASL-3 system with ASL-3 security measures)? Conditional on one of these countries having the system, what’s your guess of p(catastrophe)?
Thanks Eli for the comment.
One reason why I haven’t provided much evidence is that I think it’s substantially harder to give evidence of a “for all” claim (my side of the claim) than a “there exists” (what I ask Evan). I claim that it doesn’t happen that a framework on a niche area evolves so fast without accidents based on what I’ve seen, even in domains with substantial updates, like aviation and nuclear.
I could potentially see it happening with large accidents, but I personally don’t want to bet on that and I would want it to be transparent if that’s the assumption. I also don’t buy the “small coordinations allow larger coordinations” for domain-specific policy. Beyond what you said above, my sense is that policymakers satisfice and hence tend to not come back on a policy that sucks if that’s sufficiently good-looking to stakeholders to not have substantial incentives to change.
GDPR cookies banner sucks for everyone and haven’t been updated yet, 7 years after GDPR. Standards in the EU are not even updated more rapidly than 5y by default (I’m talking about standards, not regulation), and we’ll have to bargain to try to bring it down to reasonable timeframes AI-specific.
IAEA & safety in nuclear upgraded substantially after each accident, likewise for aviation but we’re talking about decades, not 5 years.
Thanks for your comment.
I feel like a lot of the issues in this post are that the published RSPs are not very detailed and most of the work to flesh them out is not done.
I strongly disagree with this. In my opinion, a lot of the issue is that RSPs have been thought from first principles without much consideration for everything the risk management field has done, and hence doing wrong stuff without noticing.
It’s not a matter of how detailed they are; they get the broad principles wrong. As I argued (the entire table is about this) I think that the existing principles of other existing standards are just way better and so no, it’s not a matter of details.
As I said, the details & evals of RSPs is actually the one thing that I’d keep and include in a risk management framework.
Honestly I can’t think of anything much better that could have been reasonably done given the limited time and resources we all have
Well, I recommend looking at Section 3 and the source links. Starting from those frameworks and including evals into it would be a Pareto improvement.
Thanks for your comment.
One issue is that everyone disagrees.
That’s right and that’s a consequence of uncertainty, which prevents us from bounding risks. Decreasing uncertainty (e.g. through modelling or through the ability to set bounds) is the objective of risk management.
Doses of radiation are quite predictable
I think it’s mostly in hindsight. When you read stuff about nuclear safety in the 1970s, it’s really not how it was looking.
See Section 2
the arc of new technology is not [predictable]
I think that this sets a “technology is magic” vibe which is only valid for scaling neural nets (and probably only because we haven’t invested that much into understanding scaling laws etc.), and not for most other technologies. We can actually develop technology where we know what it’s doing before building it and that’s what we should aim for given what’s at stakes here.
Thanks a lot for this constructive answer, I appreciate the engagement.
I’ll agree that it would be nice if we knew how to do this, but we do not.
With our current level of understanding, we fall at the first hurdle (we can measure some of the risks).Three points on that:
I agree that we’re pretty bad at measuring risks. But I think that the AI risk experts x forecasters x risk management experts is a very solid baseline, much more solid than not measuring the aggregate risk at all.
I think that we should do our best and measure conservatively, and that to the extent we’re uncertainty, it should be reflected in calibrated risk estimates.
I do expect the first few shots of risk estimate to be overconfident, especially to the extent they include ML researchers’ estimates. My sense from nuclear is that it’s what happened there and that failures after failures, the field got red pilled. You can read more on this here (https://en.wikipedia.org/wiki/WASH-1400).
Related to that, I think that it’s key to provide as many risk estimate feedback loops as possible by forecasting incidents in order to red-pill the field faster on the fact that they’re overconfident by default on risk levels.
This implies an immediate stop to all frontier AI development (and probably a rollback of quite a few deployed systems). We don’t understand. We cannot demonstrate risks are below acceptable levels.
That’s more complicated than that to the extent you could probably train code generation systems or other systems with narrowed down domain of operations, but I indeed think that on LLMs, risk levels would be too high to keep scaling >4 OOMs on fully general LLMs that can be plugged to tools etc.
I think that it would massively benefit to systems we understand and have could plausibly reach significant levels of capabilities at some point in the future (https://arxiv.org/abs/2006.08381). It would probably lead labs to massively invest into that.
Given our current levels of understanding, all a team of “experts” could do would be to figure out a lower bound on risk. I.e. “here are all the ways we understand that the system could go wrong, making the risk at least …”.
I agree by default we’re unable to upper bound risks and I think that’s it’s one additional failure of RSP to make as if we were able to do so. The role of calibrated forecasters in the process is to ensure that they help keeping in mind the uncertainty arising from this.
Why is pushing for risk quantification in policy a bad idea?
[...]
However, since “We should stop immediately because we don’t understand” can be said in under ten words, if any much more lengthy risk-management approach is proposed, the implicit assumption will be that it is possible to quantify the risk in a principled way. It is not.
Quantified risk estimates that are wrong are much worse than underdefined statements.
I think it’s a good point and that there should be explicit caveat to limit that but that they won’t be enough.
I think it’s a fair concern for quantified risk assessment and I expect it to be fairly likely that we fail in certain ways if we do only quantified risk assessment over the next few years. Thats why I do think we should not only do that but also deterministic safety analysis and scenario based risk analysis, which you could think of as sanity checks to ensure you’re not completely wrong in your quantified risk assessment.
Reading your points, I think that one core feature you might miss here is that uncertainty should be reflected in quantified estimates if we get forecasters into it. Hence, I expect quantified risk assessment to reveal our lack of understanding rather than suffer from it by default. I still think that your point will partially hold but much less than in the world where Anthropic dismisses accidental risks as speculative and say they’re “unlikely” (which as I say cd mean 1/1000, 1⁄100 or 1⁄10 but the lack of explicitation makes the statement reasonable sounding) without saying “oh btw we really don’t understand our systems”.
Once again, thanks a lot for your comment!
A more in-depth answer:
The permanent motte and bailey that RSPs allow (easily defensible: a framework that seems arbitrarily extensible combined with the belief that you can always change stuff in policy, even over few-years timeframe ; hardly defensible: the actual implementations & the communication around RSPs) is one of the concerns I raise explicitly, and what this comment is doing. Here, while I’m talking in large parts about the ARC RSP principles, you say that I’m talking about “current RSPs”. If you mean that we can change even the RSP principles (and not only their applicatino) to anyone who criticizes the principles of RSPs, then it’s a pretty effective way to make something literally impossible to criticize. We could have taken an arbitrary framework, push for it and say “we’ll do better soon, we need wins”. Claiming that we’ll change the framework (not only the application) in 5 years is a very extraordinary claim and does not seem a good reason to start pushing for a bad framework in the first place.
That it’s not true. The “Safe Zone” in ARC graph clearly suggests that ASL-3 are sufficient. The announce of Anthropic says “require safety, security, and operational standards appropriate to a model’s potential for catastrophic risk”. It implies that ASL-3 measures are sufficient, without actually quantifying the risk (one of the core points of my post), even qualitatively.
At a meta level, I find frustrating that the most upvoted comment, your comment, be a comment that hasn’t seriously read the post, still makes a claim about the entire post, and doesn’t address my request for evidence about the core crux (evidence of major framework changes within 5 years). If “extremely short timelines” means 5y, it seems like many have “extremely short timelines”.
You can see this section which talks about the points you raise.
Unsure how much we disagree Zach and Oliver so I’ll try to quantify: I would guess that Claude 3 will cut release date of next gen models from OpenAI by a few months at least (I would guess 3 months), which has significant effects on timelines.
Tentatively, I’m thinking that this effect may be surlinear. My model is that each new release increases the speed of development (bc of increased investment in all the value chain including compute + realization from people that it’s not like other technologies etc) and so that a few months now causes more than a few months on AGI timelines.