@SaferAI
simeon_c
A more in-depth answer:
The permanent motte and bailey that RSPs allow (easily defensible: a framework that seems arbitrarily extensible combined with the belief that you can always change stuff in policy, even over few-years timeframe ; hardly defensible: the actual implementations & the communication around RSPs) is one of the concerns I raise explicitly, and what this comment is doing. Here, while I’m talking in large parts about the ARC RSP principles, you say that I’m talking about “current RSPs”. If you mean that we can change even the RSP principles (and not only their applicatino) to anyone who criticizes the principles of RSPs, then it’s a pretty effective way to make something literally impossible to criticize. We could have taken an arbitrary framework, push for it and say “we’ll do better soon, we need wins”. Claiming that we’ll change the framework (not only the application) in 5 years is a very extraordinary claim and does not seem a good reason to start pushing for a bad framework in the first place.
That it’s not true. The “Safe Zone” in ARC graph clearly suggests that ASL-3 are sufficient. The announce of Anthropic says “require safety, security, and operational standards appropriate to a model’s potential for catastrophic risk”. It implies that ASL-3 measures are sufficient, without actually quantifying the risk (one of the core points of my post), even qualitatively.
At a meta level, I find frustrating that the most upvoted comment, your comment, be a comment that hasn’t seriously read the post, still makes a claim about the entire post, and doesn’t address my request for evidence about the core crux (evidence of major framework changes within 5 years). If “extremely short timelines” means 5y, it seems like many have “extremely short timelines”.
You can see this section which talks about the points you raise.
Thanks for your comment.
I’m sorry for that magnitude of misunderstanding, and will try to clarify it upfront in the post, but a large part of my argument is about why the principles of RSPs are not good enough, rather than the specific implementation (which is also not sufficient though, and which I argue in “Overselling, underdelivering” is one of the flaws of the framework and not just a problem that will pass).
You can check Section 3 for why I think that the principles are flawed, and Section 1 and 2 to get a better sense of what better principles look like.
Regarding the timeline, I think that it’s unreasonable to expect major framework changes over less than 5 years. And as I wrote, if you think otherwise, I’d love to hear any example of that happening in the past and the conditions under which it happened.
I do think that within the RSP framework, you can maybe get better but as I argue in Section 3, I think the framework is fundamentally flawed and should be replaced by a standard risk management framework, in which we include evals.
Responsible Scaling Policies Are Risk Management Done Wrong
A few other examples off the top of my head:
ARC graph on RSPs with the “safe zone” part
Anthropic calling ASL-4 accidental risks “speculative”
the recent TIME article saying there’s no trade off between progress and safety
More generally, for having talked to many AI policy/safety members, I can say it’s a very common pattern. At the eve of the FLI open letter, one of the most senior persons in the AI governance & policy X risk community was explaining that it was stupid to write this letter and that it would make future policy efforts much more difficult etc.
I think it still makes sense to have a heuristic of the form “I should have a particularly high bar of confidence If I do something deontologically bad that happens to be good for me personally”
Thanks a lot for writing that post.
One question I have regarding fast takeoff is: don’t you expect learning algorithms much more efficient than SGD to show up and accelerate a lot the rate of development of capabilities?
One “overhang’ I can see it the fact that humans have written a lot of what they know how to do all kinds of task on the internet and so a pretty data efficient algo could just leverage this and fairly suddenly learn a ton of tasks quite rapidly. For instance, in context learning is way more data efficient than SGD in pre-training. Right now it doesn’t seem like in context learning is exploited nearly as much as it could be. If we manage to turn ~any SGD learning problem into an in-context learning problem, which IMO could happen with an efficient long term memory and a better long context length, things could accelerate pretty wildly. Do you think that even things like that (i.e. we unlock a more data efficient Algo which allows much faster capabilities development) will necessarily be smoothed?
Cool thanks.
I’ve seen that you’ve edited your post. If you look at ASL-3 Containment Measures, I’d recommend considering editing away the “Yay” aswell.
This post is a pretty significant goalpost moving.
While my initial understanding was that the autonomous replication would be a ceiling, this doc now made it a floor.
So in other words, this paper is proposing to keep navigating beyond levels that are considered potentially catastrophic, with less-than-military-grade cybersecurity, which makes it very likely that at least one state, and plausibly multiple states, will have access to those things.
It also means that the chances of leaking a system which is irreversibly catastrophic are probably not below 0.1%, maybe not even below 1%.
My interpretation of the excitement around the proposal is a feeling that “yay, it’s better than where we were before”.
But I think it neglects heavily a few things.
1. It’s way worse than risk management 101, which is easy to push for.
2. the US population is pro-slowdown (so you can basically be way more ambitious than “responsibly scaling”)
3. an increasing share of policymakers are worried
4. self-regulation has a track record of heavily affecting hard law (either by preventing it, or by creating a template that the state can enforce. That’s the ToC that I understood from people excited by self-regulation). For instance I expect this proposal to actively harm the efforts to push for ambitious slowdowns that would let us put the probability of doom below two-digit numbers.
For those reasons, I wish this doc didn’t exist.
Can you quote the parts you’re referring to?
[Question] Do LLMs Implement NLP Algorithms for Better Next Token Predictions?
I agree with this general intuition, thanks for sharing.
I’d value descriptions specific failures you could expect from an LLM which has been tried to be RLHF-ed against “bad instrumental convergence” but where we fail/ or a better sense of how you’d guess it would look like on an LLM agent or a scaled GPT.
[Question] In the Short-Term, Why Couldn’t You Just RLHF-out Instrumental Convergence?
I meant for these to be part of the “Standards and monitoring” category of interventions (my discussion of that mentions advocacy and external pressure as important factors).
I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira’s playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they’re doing on the alignment front. I would guess you wouldn’t agree with that, but I’m not sure.
I think it’s far from obvious that an AI company needs to be a force against regulation, both conceptually (if it affects all players, it doesn’t necessarily hurt the company) and empirically.
I’m not saying that it would be a force against regulation in general but that it would be a force against any regulation which slows down substantially the current capabilities progress rate of labs. And empirics don’t demonstrate the opposite as far as I can tell.
Labs have been pushing for the rule that we should wait for evals to say “it’s dangerous” before we consider what to do, rather than do like in most other industries, i.e. that something is assumed dangerous until proven safe.
Most mentions of slowdown have been described as necessary potentially at some point in the distant future, while most people in those labs have <5y timelines.
Finally, on your conceptual part, as some argued, it’s in fact probably not possible to affect all players equally without a drastic regime of control (which is a true downside of slowing down now, but IMO still much less worse than slowing down once a leak or a jailbreak of an advanced system can cause a large-scale engineered pandemic) bc smaller actors will use the time to try to catch up as close as possible from the frontier.
will comment that it seems like a big leap from “X product was released N months earlier than otherwise” to “Transformative AI will now arrive N months earlier than otherwise.”
I agree, but if anything, my sense is that due to various compound effects (due to AI accelerating AI, to investment, to increased compute demand, and to more talent earlier), an earlier product release of N months just gives a lower bound for TAI timelines shortening (hence greater than N). Moreover, I think that the ChatGPT product release is, ex-post at least, not in the typical product release reference class. It was clearly a massive game changer for OpenAI and the entire ecosystem.
Thanks for the clarifications.
But is there another “decrease the race” or “don’t make the race worse” intervention that you think can make a big difference? Based on the fact that you’re talking about a single thing that can help massively, I don’t think you are referring to “just don’t make things worse”; what are you thinking of?
1. I think we agree on the fact that “unless it’s provably safe” is the best version of trying to get a policy slowdown.
2. I believe there are many interventions that could help on the slowdown side, most of which are unfortunately not compatible with the successful careful AI lab. The main struggle that a successful careful AI lab encounters is that it has to trade-off tons of safety principles along the way, essentially bc it needs to attract investors & talent and that attracting investors & talent is hard if you’re say too loudly that we should slow down as long as our thing is not provably safe.So de facto a successful careful AI lab will be a force against slowdown & a bunch of other relevant policies in the policy world. It will also be a force for the perceived race which is making things harder for every actor.
Other interventions for slowdown are mostly in the realm of public advocacy.
Mostly drawing upon the animal welfare activism playbook, you could use public campaigns to de facto limit the ability of labs to race, via corporate or policy advocacy campaigns.
I agree that this is an effect, directionally, but it seems small by default in a setting with lots of players (I imagine there will be, and is, a lot of “heat” to be felt regardless of any one player’s actions). And the potential benefits seem big. My rough impression is that you’re confident the costs outweigh the benefits for nearly any imaginable version of this; if that’s right, can you give some quantitative or other sense of how you get there?I guess, heuristically, I tend to take arguments of the form “but others would have done this bad thing anyway” with some skepticism because I think it tends to assume too much certainty over the counterfactual, in part due to many second order effects (e.g. the existence of one marginal key player increases the chances that more player invest, show that competition is possible etc.) that tend to be hard to compute (but are sometimes observable ex post).
On this specific case I think it’s not right that there are “lots of players” close from the frontier. If we take the case of OA and Anthropic for example, there are about 0 players at their level of deployed capabilities. Maybe Google will deploy at some point but they haven’t been serious players for the past 7 months. So if Anthropic hadn’t been around, OA could have chilled longer at ChatGPT level, and then at GPT-4 without plugins + code interpreter & without suffering from any threat. And now they’ll need to do something very impressive against the 100k context etc.
The compound effects of this are pretty substantial because for each new differentiation, it accelerates the whole field and pressures teams to find something new, causing a significantly more powerful race to the bottom.
If I had to be quantitative (vaguely) for the past 9 months, I’d guess that the existence of Anthropic has caused (/will cause, if we count the 100k thing) 2 significant counterfactual features and 3-5 months of timelines (which will probably compound into more due to self-improvement effects). I’d guess there are other effects (e.g. pressure on compute, scaling for driving costs down etc.) I’m not able to give vague estimates for.
My guess for the 3-5 months is mostly driven by the release of ChatGPT & GPT-4 which have both likely been released earlier than without Anthropic.
AGI x Animal Welfare: A High-EV Outreach Opportunity?
So I guess first you condition over alignment being solved when we win the race. Why do you think OpenAI/Anthropic are very different from DeepMind?
Thanks for writing that up.
I believe that by not touching the “decrease the race” or “don’t make the race worse” interventions, this playbook misses a big part of the picture of “how one single think could help massively”. And this core consideration is also why I don’t think that the “Successful, careful AI lab” is right.
Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction.
Extremely excited to see this new funder.
I’m pretty confident that we can indeed find a significant number of new donors for AI safety since the recent Overton window shift.Chatting with people with substantial networks, it seemed to me like a centralized non-profit fundraising effort could probably raise at least $10M. Happy to intro you to those people if relevant @habryka.
And reducing the processing time is also very exciting.
So thanks for launching this.
Thanks for writing this.
Overall, I don’t like the post much under it’s current form. There’s ~0 evidence (e.g. from Chinese newspapers) and there is very little actual argumentation. I like that you give us a local view but putting a few links to back your claims would be very very appreciated. Right now it’s hard to update on your post given that the claims are very empirical and without any external sources.
More minorly: “A domestic regulation framework for nuclear power is not a strong signal for a willingness to engage in nuclear arms reduction” I also disagree with this statement. I think it’s definitely a signal.
Thanks a lot for this constructive answer, I appreciate the engagement.
Three points on that:
I agree that we’re pretty bad at measuring risks. But I think that the AI risk experts x forecasters x risk management experts is a very solid baseline, much more solid than not measuring the aggregate risk at all.
I think that we should do our best and measure conservatively, and that to the extent we’re uncertainty, it should be reflected in calibrated risk estimates.
I do expect the first few shots of risk estimate to be overconfident, especially to the extent they include ML researchers’ estimates. My sense from nuclear is that it’s what happened there and that failures after failures, the field got red pilled. You can read more on this here (https://en.wikipedia.org/wiki/WASH-1400).
Related to that, I think that it’s key to provide as many risk estimate feedback loops as possible by forecasting incidents in order to red-pill the field faster on the fact that they’re overconfident by default on risk levels.
That’s more complicated than that to the extent you could probably train code generation systems or other systems with narrowed down domain of operations, but I indeed think that on LLMs, risk levels would be too high to keep scaling >4 OOMs on fully general LLMs that can be plugged to tools etc.
I think that it would massively benefit to systems we understand and have could plausibly reach significant levels of capabilities at some point in the future (https://arxiv.org/abs/2006.08381). It would probably lead labs to massively invest into that.
I agree by default we’re unable to upper bound risks and I think that’s it’s one additional failure of RSP to make as if we were able to do so. The role of calibrated forecasters in the process is to ensure that they help keeping in mind the uncertainty arising from this.
I think it’s a good point and that there should be explicit caveat to limit that but that they won’t be enough.
I think it’s a fair concern for quantified risk assessment and I expect it to be fairly likely that we fail in certain ways if we do only quantified risk assessment over the next few years. Thats why I do think we should not only do that but also deterministic safety analysis and scenario based risk analysis, which you could think of as sanity checks to ensure you’re not completely wrong in your quantified risk assessment.
Reading your points, I think that one core feature you might miss here is that uncertainty should be reflected in quantified estimates if we get forecasters into it. Hence, I expect quantified risk assessment to reveal our lack of understanding rather than suffer from it by default. I still think that your point will partially hold but much less than in the world where Anthropic dismisses accidental risks as speculative and say they’re “unlikely” (which as I say cd mean 1/1000, 1⁄100 or 1⁄10 but the lack of explicitation makes the statement reasonable sounding) without saying “oh btw we really don’t understand our systems”.
Once again, thanks a lot for your comment!