The capability evaluations in the Opus 4.5 system card seem worrying. The provided evidence in the system card seem pretty weak (in terms of how much it supports Anthropic’s claims). I plan to write more about this in the future; here are some of my more quickly written up thoughts.
I ultimately basically agree with their judgments about the capability thresholds they discuss. (I think the AI is very likely below the relevant AI R&D threshold, the CBRN-4 threshold, and the cyber thresholds.) But, if I just had access to the system card, I would be much more unsure. My view depends a lot on assuming some level of continuity from prior models (and assuming 4.5 Opus wasn’t a big scale up relative to prior models), on other evidence (e.g. METR time horizon results), and on some pretty illegible things (e.g. making assumptions about evaluations Anthropic ran or about the survey they did).
Some specifics:
Autonomy: For autonomy, evals are mostly saturated so they depend on an (underspecified) employee survey. They do specify a threshold, but the threshold seems totally consistent with a large chance of being above the relevant RSP threshold. (In particular the threshold is “A majority of employees surveyed think the AI can’t automate a junior researcher job AND a majority think uplift is <3x”. If 1⁄4 of employees thought it could automate a junior researcher job that would be a lot of evidence for a substantial chance it could!)
Cyber: Evals are mostly saturated. They don’t specify any threshold or argue for their ultimate judgement that the AI doesn’t pose catastrophic cyber risk.
Bio: To rule out the CBRN-4 threshold (uplift for moderately resourced state programs, e.g. North Korea), they seem to depend mostly on a text-based uplift trial. The model is extremely close to the relevant threshold and it’s unclear how much confidence we should have in this uplift trial.
Generally, it seems like the current situation is that capability evals don’t provide much assurance. This is partially Anthropic’s fault (they are supposed to do better) and partially because the problem is just difficult and unsolved.
I still think Anthropic is probably mostly doing a better job evaluating capabilities relative to other companies.
(It would be kinda reasonable for them to clearly say “Look, evaluating capabilities well is too hard and we have bigger things to worry about, so we’re going to half-ass this and make our best guess. This means we’re not longer providing much/any assurance, but we think this is a good tradeoff given the situation.”)
Some (quickly written) recommendations:
We should actually get some longer+harder AI R&D/autonomy tasks. E.g., tasks that take a human a week or two (and that junior researchers at Anthropic can somewhat reliably do). The employee survey should be improved (make sure employees have had access for >1-2 weeks, give us the exact questions, probably sanity check this more) and the threshold should probably be lower (if 1⁄4 of the employees do think the AI can automate a junior researcher, why should we have much confidence that it can’t!).
Anthropic should specify a threshold for cyber or make it clear what they using to make judgments. It would also be fine for them to say “We are no longer making a judgment on whether our AIs are above ASL-3 cyber, but we guess they probably aren’t. We won’t justify this.”
On bio, I think we need more third party review of their evals and some third party judgment of the situation because we’re plausibly getting into a pretty scary regime and their evals are extremely illegible.
We’re probably headed towards a regime of uncertainty and limited assurance. Right now is easy mode and we’re failing to some extent.
and assuming 4.5 Opus wasn’t a big scale up relative to prior models
It seems plausible that Opus 4.5 has much more RLVR than Opus 4 or Opus 4.1, catching up to Sonnet in RLVR-to-pretraining ratio (Gemini 3 Pro is probably the only other model in its weight class, with a similar amount of RLVR). If it’s a large model (many trillions of total params) that wouldn’t run decode/generation well on 8-chip Nvidia servers (with ~1 TB HBM per scale-up world), it could still be efficiently pretrained on 8-chip Nvidia servers (if overly large batch size isn’t a bottleneck), but couldn’t be RLVRed or served on them with any efficiency.
As we see with the API price drop, they likely have enough inference hardware now with large scale-up worlds (probably Trainium 2, possibly Trillium, though in principle GB200/GB300 NVL72 would also do), which wasn’t the case for Opus 4 and Opus 4.1. This hardware would also have enabled them to do efficient large scale RLVR training, which too they possibly weren’t able to do yet in the times of Opus 4 and Opus 4.1 (but there wouldn’t be an issue with Sonnet, which would fit in 8-chip Nvidia servers, so they mostly needed to apply its post-training process to the larger model).
The capability evaluations in the Opus 4.5 system card seem worrying. The provided evidence in the system card seem pretty weak (in terms of how much it supports Anthropic’s claims). I plan to write more about this in the future; here are some of my more quickly written up thoughts.
[This comment is based on this X/twitter thread I wrote]
I ultimately basically agree with their judgments about the capability thresholds they discuss. (I think the AI is very likely below the relevant AI R&D threshold, the CBRN-4 threshold, and the cyber thresholds.) But, if I just had access to the system card, I would be much more unsure. My view depends a lot on assuming some level of continuity from prior models (and assuming 4.5 Opus wasn’t a big scale up relative to prior models), on other evidence (e.g. METR time horizon results), and on some pretty illegible things (e.g. making assumptions about evaluations Anthropic ran or about the survey they did).
Some specifics:
Autonomy: For autonomy, evals are mostly saturated so they depend on an (underspecified) employee survey. They do specify a threshold, but the threshold seems totally consistent with a large chance of being above the relevant RSP threshold. (In particular the threshold is “A majority of employees surveyed think the AI can’t automate a junior researcher job AND a majority think uplift is <3x”. If 1⁄4 of employees thought it could automate a junior researcher job that would be a lot of evidence for a substantial chance it could!)
Cyber: Evals are mostly saturated. They don’t specify any threshold or argue for their ultimate judgement that the AI doesn’t pose catastrophic cyber risk.
Bio: To rule out the CBRN-4 threshold (uplift for moderately resourced state programs, e.g. North Korea), they seem to depend mostly on a text-based uplift trial. The model is extremely close to the relevant threshold and it’s unclear how much confidence we should have in this uplift trial.
Generally, it seems like the current situation is that capability evals don’t provide much assurance. This is partially Anthropic’s fault (they are supposed to do better) and partially because the problem is just difficult and unsolved.
I still think Anthropic is probably mostly doing a better job evaluating capabilities relative to other companies.
(It would be kinda reasonable for them to clearly say “Look, evaluating capabilities well is too hard and we have bigger things to worry about, so we’re going to half-ass this and make our best guess. This means we’re not longer providing much/any assurance, but we think this is a good tradeoff given the situation.”)
Some (quickly written) recommendations:
We should actually get some longer+harder AI R&D/autonomy tasks. E.g., tasks that take a human a week or two (and that junior researchers at Anthropic can somewhat reliably do). The employee survey should be improved (make sure employees have had access for >1-2 weeks, give us the exact questions, probably sanity check this more) and the threshold should probably be lower (if 1⁄4 of the employees do think the AI can automate a junior researcher, why should we have much confidence that it can’t!).
Anthropic should specify a threshold for cyber or make it clear what they using to make judgments. It would also be fine for them to say “We are no longer making a judgment on whether our AIs are above ASL-3 cyber, but we guess they probably aren’t. We won’t justify this.”
On bio, I think we need more third party review of their evals and some third party judgment of the situation because we’re plausibly getting into a pretty scary regime and their evals are extremely illegible.
We’re probably headed towards a regime of uncertainty and limited assurance. Right now is easy mode and we’re failing to some extent.
It seems plausible that Opus 4.5 has much more RLVR than Opus 4 or Opus 4.1, catching up to Sonnet in RLVR-to-pretraining ratio (Gemini 3 Pro is probably the only other model in its weight class, with a similar amount of RLVR). If it’s a large model (many trillions of total params) that wouldn’t run decode/generation well on 8-chip Nvidia servers (with ~1 TB HBM per scale-up world), it could still be efficiently pretrained on 8-chip Nvidia servers (if overly large batch size isn’t a bottleneck), but couldn’t be RLVRed or served on them with any efficiency.
As we see with the API price drop, they likely have enough inference hardware now with large scale-up worlds (probably Trainium 2, possibly Trillium, though in principle GB200/GB300 NVL72 would also do), which wasn’t the case for Opus 4 and Opus 4.1. This hardware would also have enabled them to do efficient large scale RLVR training, which too they possibly weren’t able to do yet in the times of Opus 4 and Opus 4.1 (but there wouldn’t be an issue with Sonnet, which would fit in 8-chip Nvidia servers, so they mostly needed to apply its post-training process to the larger model).
What would hard mode look like?
The AIs are obviously fully (or almost fully) automating AI R&D and we’re trying to do control evaluations.