A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections.
I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn’t how safety policies are supposed to work.
What is the change and how does it affect security?
9 days ago, Anthropic changed their RSP so that ASL-3 no longer requires being robust to employees trying to steal model weights if the employee has any access to “systems that process model weights”.
Anthropic claims this change is minor (and calls insiders with this access “sophisticated insiders”).
But, I’m not so sure it’s a small change: we don’t know what fraction of employees could get this access and “systems that process model weights” isn’t explained.
Naively, I’d guess that access to “systems that process model weights” includes employees being able to operate on the model weights in any way other than through a trusted API (a restricted API that we’re very confident is secure). If that’s right, it could be a high fraction! So, this might be a large reduction in the required level of security.
If this does actually apply to a large fraction of technical employees, then I’m also somewhat skeptical that Anthropic can actually be “highly protected” from (e.g.) organized cybercrime groups without meeting the original bar: hacking an insider and using their access is typical!
Also, one of the easiest ways for security-aware employees to evaluate security is to think about how easily they could steal the weights. So, if you don’t aim to be robust to employees, it might be much harder for employees to evaluate the level of security and then complain about not meeting requirements[1].
Anthropic’s justification and why I disagree
Anthropic justified the change by saying that model theft isn’t much of the risk from amateur CBRN uplift (CBRN-3) and that the risks from AIs being able to “fully automate the work of an entry-level, remote-only Researcher at Anthropic” (AI R&D-4) don’t depend on model theft.
I disagree.
On CBRN: If other actors are incentivized to steal the model for other reasons (e.g. models become increasingly valuable), it could end up broadly proliferating which might greatly increase risk, especially as elicitation techniques improve.
On AI R&D: AIs which are over the capability level needed to automate the work of an entry-level researcher could seriously accelerate AI R&D (via fast speed, low cost, and narrow superhumanness). If other less safe (or adversarial) actors got access, risk might increase a bunch.[2]
More strongly, ASL-3 security must suffice up until the ASL-4 threshold: it has to cover the entire range from ASL-3 to ASL-4. ASL-4 security itself is still not robust to high-effort attacks from state actors which could easily be motivated by large AI R&D acceleration.
As of the current RSP, it must suffice until just before AIs can “substantially uplift CBRN [at] state programs” or “cause dramatic acceleration in [overall AI progress]”. These seem like extremely high bars indicating very powerful systems, especially the AI R&D threshold.[3]
As it currently stands, Anthropic might not require ASL-4 security (which still isn’t sufficient for high effort state actor attacks) until we see something like 5x AI R&D acceleration (and there might be serious issues with measurement lag).
I’m somewhat sympathetic to security not being very important for ASL-3 CBRN, but it seems very important as of the ASL-3 AI R&D and seems crucial before the AI R&D ASL-4 threshold! I think the ASL-3 AI R&D threshold should probably instead trigger the ASL-4 security!
Overall, Anthropic’s justification for this last minute change seems dubious and the security requirements they’ve currently committed to seem dramatically insufficient for AI R&D threat models. To be clear, other companies have worse security commitments.
Concerns about potential noncompliance and lack of visibility
Another concern is that this last minute change is quite suggestive of Anthropic being out of compliance with their RSP before they weakened the security requirements.
We have to trust Anthropic quite a bit to rule out noncompliance. This isn’t a good state of affairs.
To explain this concern, I’ll need to cover some background on how the RSP works.
The RSP requires ASL-3 security as soon as it’s determined that ASL-3 can’t be ruled out (as Anthropic says is the case for Opus 4).
Here’s how it’s supposed to go:
They ideally have ASL-3 security mitigations ready, including the required auditing.
Once they find the model is ASL-3, they apply the mitigations immediately (if not already applied).
If they aren’t ready, they need temporary restrictions.
My concern is that the security mitigations they had ready when they found the model was ASL-3 didn’t suffice for the old ASL-3 bar but do suffice for the new bar (otherwise why did they change the bar?). So, prior to the RSP change they might have been out of compliance.
It’s certainly possible they remained compliant:
Maybe they had measures which temporarily sufficed for the old higher bar but which were too costly longer term. Also, they could have deleted the weights outside of secure storage until the RSP was updated to lower the bar.
Maybe an additional last minute security assessment (which wasn’t required to meet the standard?) indicated inadequate security and they deployed temporary measures until they changed the RSP. It would be bad to depend on last minute security assessment for compliance.
(It’s also technically possible that the ASL-3 capability decision was made after the RSP was updated. This would imply the decision was only made 8 days before release, so hopefully this isn’t right. Delaying evals until an RSP change lowers the bar would be especially bad.)
Conclusion
Overall, this incident demonstrates our limited visibility into AI companies. How many employees are covered by the new bar? What triggered this change? Why does Anthropic believe it remained in compliance? Why does Anthropic think that security isn’t important for ASL-3 AI R&D?
I think a higher level of external visibility, auditing, and public risk assessment would be needed (as a bare minimum) before placing any trust in policies like RSPs to keep the public safe from AI companies, especially as they develop existentially dangerous AIs.
To be clear, I appreciate Anthropic’s RSP update tracker and that it explains changes. Other AI companies have mostly worse safety policies: as far as I can tell, o3 and Gemini 2.5 Pro are about as likely to cross the ASL-3 bar as Opus 4 and they have much worse mitigations!
Appendix and asides
I don’t think current risks are existentially high (if current models were fully unmitigated, I’d guess this would cause around 50,000 expected fatalities per year) and temporarily being at a lower level of security for Opus 4 doesn’t seem like that big of a deal. Also, given that security is only triggered after a capability decision, the ASL-3 CBRN bar is supposed to include some conservativeness anyway. But, my broader points around visibility stand and potential noncompliance (especially unreported noncompliance) should be worrying even while the stakes are relatively low.
You can view the page showing the RSP updates including the diff of the latest change here: https://www.anthropic.com/rsp-updates. (Edit: the diff is the “redline” pdf and is only available for the most recent change.) Again, I appreciate that Anthropic has this page and makes it easy to see the changes they make to the RSP.
I find myself quite skeptical that Anthropic actually could rule out that Sonnet 4 and other models weaker than Opus 4 cross the ASL-3 CBRN threshold. How sure is Anthropic that it wouldn’t substantially assist amateurs even after the “possible performance increase from using resources that a realistic attacker would have access to”? I feel like our current evidence and understanding is so weak, and models already substantially exceed virology experts at some of our best proxy tasks.
The skepticism applies similarly or more to other AI companies (and Anthropic’s reasoning is more transparent).
But, this just serves to further drive home ways in which the current regime is unacceptable once models become so capable that the stakes are existential.
One response is that systems this powerful will be open sourced or trained by less secure AI companies anyway. Sure, but the intention of the RSP is (or was) to outline what would “keep risks below acceptable levels” if all actors follow a similar policy.
(I don’t know if I ever bought that the RSP would succeed at this. It’s also worth noting there is an explicit exit clause Anthropic could invoke if they thought proceeding outweighed the risks despite the risks being above an acceptable level.)
This sort of criticism is quite time consuming and costly for me. For this reason there are specific concerns I have about AI companies which I haven’t discussed publicly. This is likely true for other people as well. You should keep this in mind when assessing AI companies and their practices.
It also makes it harder for these complaints to be legible to other employees while other employees might be able to more easily interpret arguments about what they could do.
It looks like AI 2027 would estimate around a ~2x AI R&D acceleration for a system which was just over this ASL-3 AI R&D bar (as it seems somewhat more capable than the “Reliable agent” bar). I’d guess more like 1.5x at this point, but either way this is a big deal!
Anthropic says they’ll likely require a higher level of security for this “dramatic acceleration” AI R&D threshold, but they haven’t yet committed to this nor have they defined a lower AI R&D bar which results in an ASL-4 security requirement.
I’d been pretty much assuming that AGI labs’ “responsible scaling policies” are LARP/PR, and that if an RSP ever conflicts with their desire to release a model, either the RSP will be swiftly revised, or the testing suite for the model will be revised such that it doesn’t trigger the measures the AGI lab doesn’t want to trigger. I. e.: that RSPs are toothless and that their only purposes are to showcase how Responsible the lab is and to hype up how powerful a given model ended up.
This seems to confirm that cynicism.
(The existence of the official page tracking the updates is a (smaller) update in the other direction, though. I don’t see why they’d have it if they consciously intended to RSP-hack this way.)
Employees at Anthropic don’t think the RSP is LARP/PR. My best guess is that Dario doesn’t think the RSP is LARP/PR.
This isn’t necessarily in conflict with most of your comment.
I think I mostly agree the RSP is toothless. My sense is that for any relatively subjective criteria, like making a safety case for misalignment risk, the criteria will basically come down to “what Jared+Dario think is reasonable”. Also, if Anthropic is unable to meet this (very subjective) bar, then Anthropic will still basically do whatever Anthropic leadership thinks is best whether via maneuvering within the constraints of the RSP commitments, editing the RSP in ways which are defensible, or clearly substantially loosening the RSP and then explaining they needed to do this due to other actors having worse precautions (as is allowed by the RSP). I currently don’t expect clear cut and non-accidental procedural violations of the RSP (edit: and I think they’ll be pretty careful to avoid accidental procedural violations).
I’m skeptical of normal employees having significant influence on high stakes decisions via pressuring the leadership, but empirical evidence could change the views of Anthropic leadership.
How you feel about this state of affairs depends a lot on how much you trust Anthropic leadership to make decisions which are good from your perspective.
Minimally it’s worth noting that Dario and Jared are much less concerned about misalignment risk than I am and I expect only partial convergence in beliefs due to empirical evidence (before it’s too late).
I think the RSP still has a few important purposes:
I expect that the RSP will eventually end up with some transparency commitments with some teeth. These won’t stop Anthropic from proceeding if Anthropic leadership thinks this is best, but it might at least mean there ends up being common knowledge of whether reasonable third parties (or Anthropic leadership) think the current risk is large.
I think the RSP might end up with serious security requirements. I don’t expect these will be met on time in short timelines but the security bar specified in advance might at least create some expectations about what a baseline security expectation would be.
Anthropic might want to use the RSP the bind itself to the mast so that investors or other groups have a harder time pressuring it to spend less on security/safety.
There are some other more tenative hopes (e.g., eventually getting common expectations of serious security or safety requirements which are likely to be upheld, regulation) which aren’t impossible.
And there are some small wins already, like Google Deepmind having set some security expectations for itself which it is reasonably likely to follow through with if it isn’t too costly.
How you feel about this state of affairs depends a lot on how much you trust Anthropic leadership to make decisions which are good from your perspective.
Another note: My guess is that people on LessWrong tend to be overly pessimistic about Anthropic leadership (in terms of how good of decisions Anthropic leadership will make under the LessWrong person’s views and values) and Anthropic employees tend to be overly optimistic.
I’m less confident that people on LessWrong are overly pessimistic, but they at least seem too pessimistic about the intentions/virtue of Anthropic leadership.
For the record, I think the importance of “intentions”/values of leaders of AGI labs is overstated. What matters the most in the context of AGI labs is the virtue / power-seeking trade-offs, i.e. the propensity to do dangerous moves (/burn the commons) to unilaterally grab more power (in pursuit of whatever value).
Stuff like this op-ed, broken promise of not meaningfully pushing the frontier, Anthropic’s obsession & single focus on automating AI R&D, Dario’s explicit calls to be the first to RSI AI or Anthropic’s shady policy activity has provided ample evidence that their propensity to burn the commons to grab more power (probably in name of some values I would mostly agree with fwiw) is very high.
As a result, I’m now all-things-considered trusting Google DeepMind slightly more than Anthropic to do what’s right for AI safety. Google, as a big corp, is less likely to do unilateral power grabbing moves (such as automating AI R&D asap to achieve a decisive strategic advantage), is more likely to comply with regulations, and is already fully independent to build AGI (compute / money / talent) so won’t degrade further in terms of incentives; additionally D. Hassabis has been pretty consistent in his messaging about AI risks & AI policy, about the need for an IAEA/CERN for AI etc., Google has been mostly scaling up its safety efforts and has produced some of the best research on AI risk assessment (e.g. this excellent paper, or this one).
IMO, reasonableness and epistemic competence are also key factors. This includes stuff like how effectively they update on evidence, how much they are pushed by motivated reasoning, how good are they at futurism and thinking about what will happen. I’d also include “general competence”.
(This is a copy of my comment made on your shortform version of this point.)
Not the main thrust of the thread, but for what it’s worth, I find it somewhat anti-helpful to flatten things into a single variable of “how much you trust Anthropic leadership to make decisions which are good from your perspective”, and then ask how optimistic/pessimistic you are about this variable.
I think I am much more optimistic about Anthropic leadership on many axis relative to an overall survey of the US population or Western population – I expect them to be more libertarian, more in favor of free speech, more pro economic growth, more literate, more self-aware, higher IQ, and a bunch of things.
I am more pessimistic about their ability to withstand the pressures of a trillion dollar industry to shape their incentives than the people who are at Anthropic.
I believe the people working there are siloing themselves intellectually into an institution facing incredible financial incentives for certain bottom lines like “rapid AI progress is inevitable” and “it’s reasonably likely we can solve alignment” and “beating China in the race is a top priority”, and aren’t allowed to talk to outsiders about most details of their work, and this is a key reason that I expect them to screw up their decision-making.
I am optimistic about their relative-ability to have a sensible conversation about the next 5 years and what alignment failures look like, relative to most people on earth. This is not the standard I require to expect people to not do ML training runs that lead to human extinction, but nonetheless I predict they will do relatively quite well on this axis.
I don’t have a single variable here, I have a much more complicated model than this. It looks to me that collapsing questions of trust about people or groups into a single varibale of how optimistic I am about them making decisions which are good from my values has been a common question-substitution in the Effective Altruism scene, where I think people have been repeatedly hoodwinked by sociopaths due to not moving toward a more detailed model that predicts exactly where and when someone will make good vs bad decisions.
I certainly agree that the pressures and epistemic environment should make you less optimistic about good decisions being made. And that thinking through the overall situation and what types or decisions you care about are important. (Like, you can think of my comment as making a claim about the importance weighted goodness of decisions.)
I don’t see the relevance of “relative decision making goodness compared to the general population” which I think you agree with, but in that case I don’t see what this was responding to.
Not sure I agree with other aspects of this comment and implications. Like, I think reducing things to a variable like “how good is it to generically empowering this person/group” is pretty reasonable in the case of Anthropic leadership because in a lot of cases they’d have a huge amount of general open ended power, though a detailed model (taking into account what decisions you care about etc) would need to feed into this.
What’s an example decision or two where you would want to ask yourself whether they should get more or less open-ended power? I’m not sure what you’re thinking of.
I think the main thing I want to convey is that I think you’re saying that LWers (of which I am one) have a very low opinion of the integrity of people at Anthropic, but what I’m actually saying that their integrity is no match for the forces that they are being tested with.
I don’t need to be able to predict a lot of fine details about individuals’ decision-making in order to be able to have good estimates of these two quantities, and comparing them is the second-most question relating to whether it’s good to work on capabilities at Anthropic. (The first one is a basic ethical question about working on a potentially extinction-causing technology that is not much related to the details of which capabilities company you’re working on.)
I think you’re saying that LWers (of which I am one) have a very low opinion of the integrity of people at Anthropic
This is related to what I was saying but it wasn’t what I was saying. I was saying “tend to be overly pessimistic about Anthropic leadership (in terms of how good of decisions Anthropic leadership will make under the LessWrong person’s views and values)”. I wasn’t making a claim about the perceived absolute level of integrity.
Probably not worth hashing this out further, I think I get what you’re saying.
Employees at Anthropic don’t think the RSP is LARP/PR. My best guess is that Dario doesn’t think the RSP is LARP/PR.
Yeah, I don’t think this is necessarily in contradiction with my comment. Things can be effectively just LARP/PR without being consciously LARP/PR. (Indeed, this is likely the case in most instances of LARP-y behavior.)
Can you explain how you got the diffs from https://www.anthropic.com/rsp-updates ? I see the links to previous versions, but nothing that’s obviously a diff view to see the actual changed language.
I feel as though I must be missing the motivation for Anthropic to do this. Why put so much effort into safety/alignment research just to intentionally fumble the ball on actual physical security?
I would like to understand why they would resist this. Is increasing physical security so onerous that it’s going to seriously hamper their research efficiency?
I think security is legitimately hard and can be costly in research efficiency. I think there is a defensible case for this ASL-3 security bar being reasonable for the ASL-3 CBRN threshold, but it seems too weak for the ASL-3 AI R&D threshold (hopefully the bar for things like this ends up being higher).
Could you give an example of where security would negatively effect research efficiency? Like what is the actual implementation difficulty that arises from increased physical security?
Every time you want to interact with the weights in some non-basic way, you need to have another randomly selected person who inspects in detail all the code and commands you run.
The datacenter and office are airgapped and so you don’t have internet access.
Increased physical security isn’t much of difficulty.
This is a great post. Good eye for catching this and making the connections here. I think I expect to see more “cutting corners” like this though I’m not sure what to do about it since I don’t think internally it will feel like corners are being cut rather than necessary updates that are only obvious in hindsight.
A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections.
I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn’t how safety policies are supposed to work.
(This was originally a tweet thread (https://x.com/RyanPGreenblatt/status/1925992236648464774) which I’ve converted into a LessWrong quick take.)
What is the change and how does it affect security?
9 days ago, Anthropic changed their RSP so that ASL-3 no longer requires being robust to employees trying to steal model weights if the employee has any access to “systems that process model weights”.
Anthropic claims this change is minor (and calls insiders with this access “sophisticated insiders”).
But, I’m not so sure it’s a small change: we don’t know what fraction of employees could get this access and “systems that process model weights” isn’t explained.
Naively, I’d guess that access to “systems that process model weights” includes employees being able to operate on the model weights in any way other than through a trusted API (a restricted API that we’re very confident is secure). If that’s right, it could be a high fraction! So, this might be a large reduction in the required level of security.
If this does actually apply to a large fraction of technical employees, then I’m also somewhat skeptical that Anthropic can actually be “highly protected” from (e.g.) organized cybercrime groups without meeting the original bar: hacking an insider and using their access is typical!
Also, one of the easiest ways for security-aware employees to evaluate security is to think about how easily they could steal the weights. So, if you don’t aim to be robust to employees, it might be much harder for employees to evaluate the level of security and then complain about not meeting requirements[1].
Anthropic’s justification and why I disagree
Anthropic justified the change by saying that model theft isn’t much of the risk from amateur CBRN uplift (CBRN-3) and that the risks from AIs being able to “fully automate the work of an entry-level, remote-only Researcher at Anthropic” (AI R&D-4) don’t depend on model theft.
I disagree.
On CBRN: If other actors are incentivized to steal the model for other reasons (e.g. models become increasingly valuable), it could end up broadly proliferating which might greatly increase risk, especially as elicitation techniques improve.
On AI R&D: AIs which are over the capability level needed to automate the work of an entry-level researcher could seriously accelerate AI R&D (via fast speed, low cost, and narrow superhumanness). If other less safe (or adversarial) actors got access, risk might increase a bunch.[2]
More strongly, ASL-3 security must suffice up until the ASL-4 threshold: it has to cover the entire range from ASL-3 to ASL-4. ASL-4 security itself is still not robust to high-effort attacks from state actors which could easily be motivated by large AI R&D acceleration.
As of the current RSP, it must suffice until just before AIs can “substantially uplift CBRN [at] state programs” or “cause dramatic acceleration in [overall AI progress]”. These seem like extremely high bars indicating very powerful systems, especially the AI R&D threshold.[3]
As it currently stands, Anthropic might not require ASL-4 security (which still isn’t sufficient for high effort state actor attacks) until we see something like 5x AI R&D acceleration (and there might be serious issues with measurement lag).
I’m somewhat sympathetic to security not being very important for ASL-3 CBRN, but it seems very important as of the ASL-3 AI R&D and seems crucial before the AI R&D ASL-4 threshold! I think the ASL-3 AI R&D threshold should probably instead trigger the ASL-4 security!
Overall, Anthropic’s justification for this last minute change seems dubious and the security requirements they’ve currently committed to seem dramatically insufficient for AI R&D threat models. To be clear, other companies have worse security commitments.
Concerns about potential noncompliance and lack of visibility
Another concern is that this last minute change is quite suggestive of Anthropic being out of compliance with their RSP before they weakened the security requirements.
We have to trust Anthropic quite a bit to rule out noncompliance. This isn’t a good state of affairs.
To explain this concern, I’ll need to cover some background on how the RSP works.
The RSP requires ASL-3 security as soon as it’s determined that ASL-3 can’t be ruled out (as Anthropic says is the case for Opus 4).
Here’s how it’s supposed to go:
They ideally have ASL-3 security mitigations ready, including the required auditing.
Once they find the model is ASL-3, they apply the mitigations immediately (if not already applied).
If they aren’t ready, they need temporary restrictions.
My concern is that the security mitigations they had ready when they found the model was ASL-3 didn’t suffice for the old ASL-3 bar but do suffice for the new bar (otherwise why did they change the bar?). So, prior to the RSP change they might have been out of compliance.
It’s certainly possible they remained compliant:
Maybe they had measures which temporarily sufficed for the old higher bar but which were too costly longer term. Also, they could have deleted the weights outside of secure storage until the RSP was updated to lower the bar.
Maybe an additional last minute security assessment (which wasn’t required to meet the standard?) indicated inadequate security and they deployed temporary measures until they changed the RSP. It would be bad to depend on last minute security assessment for compliance.
(It’s also technically possible that the ASL-3 capability decision was made after the RSP was updated. This would imply the decision was only made 8 days before release, so hopefully this isn’t right. Delaying evals until an RSP change lowers the bar would be especially bad.)
Conclusion
Overall, this incident demonstrates our limited visibility into AI companies. How many employees are covered by the new bar? What triggered this change? Why does Anthropic believe it remained in compliance? Why does Anthropic think that security isn’t important for ASL-3 AI R&D?
I think a higher level of external visibility, auditing, and public risk assessment would be needed (as a bare minimum) before placing any trust in policies like RSPs to keep the public safe from AI companies, especially as they develop existentially dangerous AIs.
To be clear, I appreciate Anthropic’s RSP update tracker and that it explains changes. Other AI companies have mostly worse safety policies: as far as I can tell, o3 and Gemini 2.5 Pro are about as likely to cross the ASL-3 bar as Opus 4 and they have much worse mitigations!
Appendix and asides
I don’t think current risks are existentially high (if current models were fully unmitigated, I’d guess this would cause around 50,000 expected fatalities per year) and temporarily being at a lower level of security for Opus 4 doesn’t seem like that big of a deal. Also, given that security is only triggered after a capability decision, the ASL-3 CBRN bar is supposed to include some conservativeness anyway. But, my broader points around visibility stand and potential noncompliance (especially unreported noncompliance) should be worrying even while the stakes are relatively low.
You can view the page showing the RSP updates including the diff of the latest change here: https://www.anthropic.com/rsp-updates. (Edit: the diff is the “redline” pdf and is only available for the most recent change.) Again, I appreciate that Anthropic has this page and makes it easy to see the changes they make to the RSP.
I find myself quite skeptical that Anthropic actually could rule out that Sonnet 4 and other models weaker than Opus 4 cross the ASL-3 CBRN threshold. How sure is Anthropic that it wouldn’t substantially assist amateurs even after the “possible performance increase from using resources that a realistic attacker would have access to”? I feel like our current evidence and understanding is so weak, and models already substantially exceed virology experts at some of our best proxy tasks.
The skepticism applies similarly or more to other AI companies (and Anthropic’s reasoning is more transparent).
But, this just serves to further drive home ways in which the current regime is unacceptable once models become so capable that the stakes are existential.
One response is that systems this powerful will be open sourced or trained by less secure AI companies anyway. Sure, but the intention of the RSP is (or was) to outline what would “keep risks below acceptable levels” if all actors follow a similar policy.
(I don’t know if I ever bought that the RSP would succeed at this. It’s also worth noting there is an explicit exit clause Anthropic could invoke if they thought proceeding outweighed the risks despite the risks being above an acceptable level.)
This sort of criticism is quite time consuming and costly for me. For this reason there are specific concerns I have about AI companies which I haven’t discussed publicly. This is likely true for other people as well. You should keep this in mind when assessing AI companies and their practices.
It also makes it harder for these complaints to be legible to other employees while other employees might be able to more easily interpret arguments about what they could do.
It looks like AI 2027 would estimate around a ~2x AI R&D acceleration for a system which was just over this ASL-3 AI R&D bar (as it seems somewhat more capable than the “Reliable agent” bar). I’d guess more like 1.5x at this point, but either way this is a big deal!
Anthropic says they’ll likely require a higher level of security for this “dramatic acceleration” AI R&D threshold, but they haven’t yet committed to this nor have they defined a lower AI R&D bar which results in an ASL-4 security requirement.
I’d been pretty much assuming that AGI labs’ “responsible scaling policies” are LARP/PR, and that if an RSP ever conflicts with their desire to release a model, either the RSP will be swiftly revised, or the testing suite for the model will be revised such that it doesn’t trigger the measures the AGI lab doesn’t want to trigger. I. e.: that RSPs are toothless and that their only purposes are to showcase how Responsible the lab is and to hype up how powerful a given model ended up.
This seems to confirm that cynicism.
(The existence of the official page tracking the updates is a (smaller) update in the other direction, though. I don’t see why they’d have it if they consciously intended to RSP-hack this way.)
Employees at Anthropic don’t think the RSP is LARP/PR. My best guess is that Dario doesn’t think the RSP is LARP/PR.
This isn’t necessarily in conflict with most of your comment.
I think I mostly agree the RSP is toothless. My sense is that for any relatively subjective criteria, like making a safety case for misalignment risk, the criteria will basically come down to “what Jared+Dario think is reasonable”. Also, if Anthropic is unable to meet this (very subjective) bar, then Anthropic will still basically do whatever Anthropic leadership thinks is best whether via maneuvering within the constraints of the RSP commitments, editing the RSP in ways which are defensible, or clearly substantially loosening the RSP and then explaining they needed to do this due to other actors having worse precautions (as is allowed by the RSP). I currently don’t expect clear cut and non-accidental procedural violations of the RSP (edit: and I think they’ll be pretty careful to avoid accidental procedural violations).
I’m skeptical of normal employees having significant influence on high stakes decisions via pressuring the leadership, but empirical evidence could change the views of Anthropic leadership.
How you feel about this state of affairs depends a lot on how much you trust Anthropic leadership to make decisions which are good from your perspective.
Minimally it’s worth noting that Dario and Jared are much less concerned about misalignment risk than I am and I expect only partial convergence in beliefs due to empirical evidence (before it’s too late).
I think the RSP still has a few important purposes:
I expect that the RSP will eventually end up with some transparency commitments with some teeth. These won’t stop Anthropic from proceeding if Anthropic leadership thinks this is best, but it might at least mean there ends up being common knowledge of whether reasonable third parties (or Anthropic leadership) think the current risk is large.
I think the RSP might end up with serious security requirements. I don’t expect these will be met on time in short timelines but the security bar specified in advance might at least create some expectations about what a baseline security expectation would be.
Anthropic might want to use the RSP the bind itself to the mast so that investors or other groups have a harder time pressuring it to spend less on security/safety.
There are some other more tenative hopes (e.g., eventually getting common expectations of serious security or safety requirements which are likely to be upheld, regulation) which aren’t impossible.
And there are some small wins already, like Google Deepmind having set some security expectations for itself which it is reasonably likely to follow through with if it isn’t too costly.
Another note: My guess is that people on LessWrong tend to be overly pessimistic about Anthropic leadership (in terms of how good of decisions Anthropic leadership will make under the LessWrong person’s views and values) and Anthropic employees tend to be overly optimistic.
I’m less confident that people on LessWrong are overly pessimistic, but they at least seem too pessimistic about the intentions/virtue of Anthropic leadership.
For the record, I think the importance of “intentions”/values of leaders of AGI labs is overstated. What matters the most in the context of AGI labs is the virtue / power-seeking trade-offs, i.e. the propensity to do dangerous moves (/burn the commons) to unilaterally grab more power (in pursuit of whatever value).
Stuff like this op-ed, broken promise of not meaningfully pushing the frontier, Anthropic’s obsession & single focus on automating AI R&D, Dario’s explicit calls to be the first to RSI AI or Anthropic’s shady policy activity has provided ample evidence that their propensity to burn the commons to grab more power (probably in name of some values I would mostly agree with fwiw) is very high.
As a result, I’m now all-things-considered trusting Google DeepMind slightly more than Anthropic to do what’s right for AI safety. Google, as a big corp, is less likely to do unilateral power grabbing moves (such as automating AI R&D asap to achieve a decisive strategic advantage), is more likely to comply with regulations, and is already fully independent to build AGI (compute / money / talent) so won’t degrade further in terms of incentives; additionally D. Hassabis has been pretty consistent in his messaging about AI risks & AI policy, about the need for an IAEA/CERN for AI etc., Google has been mostly scaling up its safety efforts and has produced some of the best research on AI risk assessment (e.g. this excellent paper, or this one).
IMO, reasonableness and epistemic competence are also key factors. This includes stuff like how effectively they update on evidence, how much they are pushed by motivated reasoning, how good are they at futurism and thinking about what will happen. I’d also include “general competence”.
(This is a copy of my comment made on your shortform version of this point.)
Not the main thrust of the thread, but for what it’s worth, I find it somewhat anti-helpful to flatten things into a single variable of “how much you trust Anthropic leadership to make decisions which are good from your perspective”, and then ask how optimistic/pessimistic you are about this variable.
I think I am much more optimistic about Anthropic leadership on many axis relative to an overall survey of the US population or Western population – I expect them to be more libertarian, more in favor of free speech, more pro economic growth, more literate, more self-aware, higher IQ, and a bunch of things.
I am more pessimistic about their ability to withstand the pressures of a trillion dollar industry to shape their incentives than the people who are at Anthropic.
I believe the people working there are siloing themselves intellectually into an institution facing incredible financial incentives for certain bottom lines like “rapid AI progress is inevitable” and “it’s reasonably likely we can solve alignment” and “beating China in the race is a top priority”, and aren’t allowed to talk to outsiders about most details of their work, and this is a key reason that I expect them to screw up their decision-making.
I am optimistic about their relative-ability to have a sensible conversation about the next 5 years and what alignment failures look like, relative to most people on earth. This is not the standard I require to expect people to not do ML training runs that lead to human extinction, but nonetheless I predict they will do relatively quite well on this axis.
I don’t have a single variable here, I have a much more complicated model than this. It looks to me that collapsing questions of trust about people or groups into a single varibale of how optimistic I am about them making decisions which are good from my values has been a common question-substitution in the Effective Altruism scene, where I think people have been repeatedly hoodwinked by sociopaths due to not moving toward a more detailed model that predicts exactly where and when someone will make good vs bad decisions.
I certainly agree that the pressures and epistemic environment should make you less optimistic about good decisions being made. And that thinking through the overall situation and what types or decisions you care about are important. (Like, you can think of my comment as making a claim about the importance weighted goodness of decisions.)
I don’t see the relevance of “relative decision making goodness compared to the general population” which I think you agree with, but in that case I don’t see what this was responding to.
Not sure I agree with other aspects of this comment and implications. Like, I think reducing things to a variable like “how good is it to generically empowering this person/group” is pretty reasonable in the case of Anthropic leadership because in a lot of cases they’d have a huge amount of general open ended power, though a detailed model (taking into account what decisions you care about etc) would need to feed into this.
What’s an example decision or two where you would want to ask yourself whether they should get more or less open-ended power? I’m not sure what you’re thinking of.
How good/bad is it to work on capabilities at Anthropic?
That’s the most clear cut case, but lots of stuff trades off anthropic power with other stuff.
I think the main thing I want to convey is that I think you’re saying that LWers (of which I am one) have a very low opinion of the integrity of people at Anthropic, but what I’m actually saying that their integrity is no match for the forces that they are being tested with.
I don’t need to be able to predict a lot of fine details about individuals’ decision-making in order to be able to have good estimates of these two quantities, and comparing them is the second-most question relating to whether it’s good to work on capabilities at Anthropic. (The first one is a basic ethical question about working on a potentially extinction-causing technology that is not much related to the details of which capabilities company you’re working on.)
This is related to what I was saying but it wasn’t what I was saying. I was saying “tend to be overly pessimistic about Anthropic leadership (in terms of how good of decisions Anthropic leadership will make under the LessWrong person’s views and values)”. I wasn’t making a claim about the perceived absolute level of integrity.
Probably not worth hashing this out further, I think I get what you’re saying.
Yeah, I don’t think this is necessarily in contradiction with my comment. Things can be effectively just LARP/PR without being consciously LARP/PR. (Indeed, this is likely the case in most instances of LARP-y behavior.)
Agreed on the rest.
Can you explain how you got the diffs from https://www.anthropic.com/rsp-updates ? I see the links to previous versions, but nothing that’s obviously a diff view to see the actual changed language.
On the website, it’s the link titled “redline” (it’s only available for the most recent version).
I’ve made these for past versions but they aren’t online at the moment, can provide on request though.
I feel as though I must be missing the motivation for Anthropic to do this. Why put so much effort into safety/alignment research just to intentionally fumble the ball on actual physical security?
I would like to understand why they would resist this. Is increasing physical security so onerous that it’s going to seriously hamper their research efficiency?
I think security is legitimately hard and can be costly in research efficiency. I think there is a defensible case for this ASL-3 security bar being reasonable for the ASL-3 CBRN threshold, but it seems too weak for the ASL-3 AI R&D threshold (hopefully the bar for things like this ends up being higher).
Could you give an example of where security would negatively effect research efficiency? Like what is the actual implementation difficulty that arises from increased physical security?
Every time you want to interact with the weights in some non-basic way, you need to have another randomly selected person who inspects in detail all the code and commands you run.
The datacenter and office are airgapped and so you don’t have internet access.
Increased physical security isn’t much of difficulty.
Ah yeah I can totally see how that first one at the least would be a big loss in efficiency. Thanks for clarifying.
This is a great post. Good eye for catching this and making the connections here. I think I expect to see more “cutting corners” like this though I’m not sure what to do about it since I don’t think internally it will feel like corners are being cut rather than necessary updates that are only obvious in hindsight.