TL;DR: OpenAI released GPT-5.4 Thinking and GPT-5.4 Pro on March 5, 2026. GPT-5.4 Pro is likely the best model in the world for many catastrophic risk-relevant tasks, including biological research R&D, orchestrating cyberoffense operations, and computer use. GPT-5.4 Pro has no system card, only GPT-5.4 Thinking, and, to our best knowledge, Pro has been released without any safety evals. We argue this has occurred at least once before, with GPT-5.2 Pro, and provide recommendations for how a team could conduct fast, independent risk assessments of models post-deployment.
IMPORTANT EDIT: This problem, where Pro models don’t have a safety card, has existed since at least o3-pro. Others have noticed this issue before (for o3 and GPT-5). Additionally, Pro “models” are probably just fancy scaffolding that leverage test-time compute on top of the Thinking models. However, we think our recommendations still stand, because:
The best system (e.g., model + leveraging parallel test-time compute) has meaningfully greater capabilities in catastrophic risk-relevant areas than just the model. In this case, the model developer clearly thinks so because of its much greater price, because it is listed as a separate model in the API, and because of the way that it is framed in the model release announcement.
There exist difficult-to-externally-verify claims from OAI about when it has a (soft) commitment to releasing Pro models’ benchmark scores, meaning the safety community should probably still look for these things in future models.
There are many other frontier and open-source labs for which the pre-deployment public evaluations are wholly insufficient to meaningfully assess risk.
We should have been more aggressive about looking into what Pro models actually are, and for others’ previous comments on this topic. See this comment thread for more—thanks to loops who pointed this out :)
GPT-5.4 Pro is really good
OpenAI released both GPT-5.4 Thinking (what people usually mean when they say GPT-5.4) and GPT-5.4 Pro[1], the latter of which is designed for “people who want maximum performance on complex tasks.” GPT-5.4 Pro is extremely expensive, and takes a very long time to complete a task. However, it is likely the best model in the world in several areas, including expert-level Q&A and browser use. Alongside the release announcement, OpenAI presented GPT-5.4 Pro’s performance on a subset of capability benchmarks. Here’s a comparison of benchmark scores across the top three frontier models; we only report scores if they were in all models’ system cards[2]:
Benchmark
Gemini 3.1 Pro
GPT-5.4 Pro
Opus 4.6
GPQA Diamond
94.3%
94.4%
91.3%
HLE (no tools)
44.4%
42.7%
40.0%
HLE (with tools)
51.4%
58.7%
53.1%
ARC-AGI-2 (Verified)
77.1%
83.3%
68.8%
BrowseComp
85.9%
89.3%
84.0%
Based on these results, we expect GPT-5.4 Pro to be SOTA at the Virology Capabilities Test, Agentic-Bio Capabilities Benchmark, FrontierMath, and anywhere else depending on academic reasoning, a broad knowledge base, and that scales nicely with inference compute. BrowseComp and SOTA on FinanceAgent v1.1 (61.5%) make us think it’s probably also SOTA at automating office work generally.
The biggest hole in saying that it’s overall SOTA is agentic coding, but given SOTA measures of abstract reasoning through ARC-AGI-2, we think it’s likely that, given enough compute, it would beat Opus 4.6 and Gemini 3.1 Pro on things like SWE-Bench and Terminal-Bench 2.0.
Yet, it was released without any public safety evals. The system card published alongside the release is only for GPT-5.4 Thinking. It’s possible that GPT-5.4 Pro was tested for safety properties internally (We would hope at least something like Petri was run to make sure there wasn’t a crazy distribution shift?), but we were unable to find any public information about this, if true. We would bet significant money that OAI did not run a suite of internal evals at least as comprehensive as those in the GPT-5.4 Thinking model card prior to Pro’s release.
It is highly unlikely that GPT-5.4 Pro poses catastrophic misuse or misalignment risks, although this is largely because of mitigations that come for free with closed-source models from OpenAI (e.g., CBRNE classifiers). However, releasing no external safety evals is a bad precedent and gives researchers a false understanding of current risks posed by frontier models. Additionally, if GPT-5.4 Pro turned out to be much better on dual-use tasks (e.g., EVM-Bench or LAB-Bench), we would have been able to update our timelines to the critical period of risk accordingly.
This has happened once already
The only reason we were tracking this is because I (Parv) accidentally spent $6,000 of Andy’s compute running LAB-Bench against GPT-5.2 Pro instead of GPT-5.2 Thinking[3], and we noticed quite a high uplift.
In fact, GPT-5.2 Pro without tools shows comparable performance to Opus 4.6 with tools in Fig-QA (78.3%). We then noticed that we could not corroborate this result, or indeed any safety-relevant benchmark performance, because GPT-5.2 Pro was also released without a system card.
GPT-5.2 Pro was released on December 11, 2025, and Opus 4.6, the first model that seems to outclass it, was released February 5, 2026. Our median guess here is that we had a model that was SOTA on (at minimum) dual-use biology tasks for (at minimum) two months, released without any safety evals, and which the broader safety community largely ignored (see our edit).
What do we do??
We basically assumed that the top three US labs (OAI, Ant, GDM) would, at minimum, publish something matching the concept of a model card with every SOTA model, which was great because it helped us get a better handle on risk. We now think we were wrong, and we can no longer assume labs will provide any safety-relevant benchmark data for their best models at release. However, this data is still extremely important, especially for tracking jagged-y capabilities like CBRNE uplift and cyberoffense.
At minimum, we recommend that a 1-3 person team at an existing organization:
Set up a “press the easy button” framework that can run a large eval suite of existing evals as soon as a model is publicly released, and generate a public report describing its potential for catastrophic misuse risk, and give us some insight into whether it’s scheming, prosaically misaligned, etc. To start, this might literally just be ABC-Bench, VCT, Petri, and EVM-Bench[4].
Run said framework for every major model release without a substantive system card[5].
We have a list of evaluations we think such a framework should include, and other ideas for how to make this go well—please reach out!
A more ambitious version of this would also create new evals, and include things like interp to lower-bound sandbagging. It would also coordinate with safety researchers, DC policy folks, and interested parties in USG natsec to frame their assessment in a way accessible to them.
We are also embarrassed that no one (to our knowledge) has commented on this before(this is only true for >GPT 5 models; see our edit) and it took both of us so long to notice this. So, how could we have thought this faster?
More people outside labs need to read system cards in full, within days of their release. Set up a reading club, make a really good agent scaffold, find some way of getting the important info into your brain and noticing inconsistencies.
Maybe someone can just build a really good Claude Code skill that does this for new model releases? Seems like a task that should take ~2 hours. Get in touch if you’d like to build this!
Lab safety teams should seek through the Frontier Model Forum and other, more informal coordination mechanisms to standardize benchmarks and set a norm of releasing lots of safety benchmark performance data.
Safety teams should prioritize releasing safety benchmark datasets, through Trusted Access Programs if appropriate. This will allow the safety community to directly compare different models’ benchmark scores, and get a better sense of risk outside of just “the new one scored less on this sabotage eval.”
In the absence of comprehensive and informative safety evaluations of frontier models from labs themselves, we hope the community can fill this gap while also pushing labs to be more transparent[6].
One question we don’t answer here is “what exactly is Pro???” Is it a different model, or a weird scaffold, or finetuning on Thinking, or something else? We don’t have great answers here; would love to learn more.See our edit
Claude has some important notes. “A few caveats worth flagging: the HLE ‘with tools’ rows use different harnesses (Gemini uses search-blocklist + code; OpenAI’s harness isn’t specified the same way), so that row is somewhat apples-to-oranges. BrowseComp similarly — Gemini specifies “Search + Python + Browse” while OpenAI’s tooling setup isn’t detailed identically. GPQA Diamond is essentially a tie at 94.3 vs 94.4.”
The main blocker here is cost, but we think funders would be interested in throwing compute at this; we have seen preliminary interest from many stakeholders in the community, in both policy and technical circles.
Huge thanks to everyone on the Kimi K2.5 eval team, without whom we would never have ran into this. We also thank Claude Opus 4.6, who accidentally ran Pro instead of Thinking on LAB-Bench and burnt $6k for what ended up being a good cause. We promise we are competent researchers, and have learnt our lesson.
The current SOTA model was released without safety evals
TL;DR: OpenAI released GPT-5.4 Thinking and GPT-5.4 Pro on March 5, 2026. GPT-5.4 Pro is likely the best model in the world for many catastrophic risk-relevant tasks, including biological research R&D, orchestrating cyberoffense operations, and computer use. GPT-5.4 Pro has no system card, only GPT-5.4 Thinking, and, to our best knowledge, Pro has been released without any safety evals. We argue this has occurred at least once before, with GPT-5.2 Pro, and provide recommendations for how a team could conduct fast, independent risk assessments of models post-deployment.
IMPORTANT EDIT: This problem, where Pro models don’t have a safety card, has existed since at least o3-pro. Others have noticed this issue before (for o3 and GPT-5). Additionally, Pro “models” are probably just fancy scaffolding that leverage test-time compute on top of the Thinking models. However, we think our recommendations still stand, because:
The best system (e.g., model + leveraging parallel test-time compute) has meaningfully greater capabilities in catastrophic risk-relevant areas than just the model. In this case, the model developer clearly thinks so because of its much greater price, because it is listed as a separate model in the API, and because of the way that it is framed in the model release announcement.
There exist difficult-to-externally-verify claims from OAI about when it has a (soft) commitment to releasing Pro models’ benchmark scores, meaning the safety community should probably still look for these things in future models.
There are many other frontier and open-source labs for which the pre-deployment public evaluations are wholly insufficient to meaningfully assess risk.
We should have been more aggressive about looking into what Pro models actually are, and for others’ previous comments on this topic. See this comment thread for more—thanks to loops who pointed this out :)
GPT-5.4 Pro is really good
OpenAI released both GPT-5.4 Thinking (what people usually mean when they say GPT-5.4) and GPT-5.4 Pro[1], the latter of which is designed for “people who want maximum performance on complex tasks.” GPT-5.4 Pro is extremely expensive, and takes a very long time to complete a task. However, it is likely the best model in the world in several areas, including expert-level Q&A and browser use. Alongside the release announcement, OpenAI presented GPT-5.4 Pro’s performance on a subset of capability benchmarks. Here’s a comparison of benchmark scores across the top three frontier models; we only report scores if they were in all models’ system cards[2]:
Benchmark
Gemini 3.1 Pro
GPT-5.4 Pro
Opus 4.6
Based on these results, we expect GPT-5.4 Pro to be SOTA at the Virology Capabilities Test, Agentic-Bio Capabilities Benchmark, FrontierMath, and anywhere else depending on academic reasoning, a broad knowledge base, and that scales nicely with inference compute. BrowseComp and SOTA on FinanceAgent v1.1 (61.5%) make us think it’s probably also SOTA at automating office work generally.
The biggest hole in saying that it’s overall SOTA is agentic coding, but given SOTA measures of abstract reasoning through ARC-AGI-2, we think it’s likely that, given enough compute, it would beat Opus 4.6 and Gemini 3.1 Pro on things like SWE-Bench and Terminal-Bench 2.0.
Yet, it was released without any public safety evals. The system card published alongside the release is only for GPT-5.4 Thinking. It’s possible that GPT-5.4 Pro was tested for safety properties internally (We would hope at least something like Petri was run to make sure there wasn’t a crazy distribution shift?), but we were unable to find any public information about this, if true. We would bet significant money that OAI did not run a suite of internal evals at least as comprehensive as those in the GPT-5.4 Thinking model card prior to Pro’s release.
It is highly unlikely that GPT-5.4 Pro poses catastrophic misuse or misalignment risks, although this is largely because of mitigations that come for free with closed-source models from OpenAI (e.g., CBRNE classifiers). However, releasing no external safety evals is a bad precedent and gives researchers a false understanding of current risks posed by frontier models. Additionally, if GPT-5.4 Pro turned out to be much better on dual-use tasks (e.g., EVM-Bench or LAB-Bench), we would have been able to update our timelines to the critical period of risk accordingly.
This has happened once already
The only reason we were tracking this is because I (Parv) accidentally spent $6,000 of Andy’s compute running LAB-Bench against GPT-5.2 Pro instead of GPT-5.2 Thinking[3], and we noticed quite a high uplift.
In fact, GPT-5.2 Pro without tools shows comparable performance to Opus 4.6 with tools in Fig-QA (78.3%). We then noticed that we could not corroborate this result, or indeed any safety-relevant benchmark performance, because GPT-5.2 Pro was also released without a system card.
GPT-5.2 Pro was released on December 11, 2025, and Opus 4.6, the first model that seems to outclass it, was released February 5, 2026. Our median guess here is that we had a model that was SOTA on (at minimum) dual-use biology tasks for (at minimum) two months, released without any safety evals,
and which the broader safety community largely ignored(see our edit).What do we do??
We basically assumed that the top three US labs (OAI, Ant, GDM) would, at minimum, publish something matching the concept of a model card with every SOTA model, which was great because it helped us get a better handle on risk. We now think we were wrong, and we can no longer assume labs will provide any safety-relevant benchmark data for their best models at release. However, this data is still extremely important, especially for tracking jagged-y capabilities like CBRNE uplift and cyberoffense.
At minimum, we recommend that a 1-3 person team at an existing organization:
Set up a “press the easy button” framework that can run a large eval suite of existing evals as soon as a model is publicly released, and generate a public report describing its potential for catastrophic misuse risk, and give us some insight into whether it’s scheming, prosaically misaligned, etc. To start, this might literally just be ABC-Bench, VCT, Petri, and EVM-Bench[4].
Run said framework for every major model release without a substantive system card[5].
We have a list of evaluations we think such a framework should include, and other ideas for how to make this go well—please reach out!
A more ambitious version of this would also create new evals, and include things like interp to lower-bound sandbagging. It would also coordinate with safety researchers, DC policy folks, and interested parties in USG natsec to frame their assessment in a way accessible to them.
We are also embarrassed
that no one (to our knowledge) has commented on this before(this is only true for >GPT 5 models; see our edit) and it took both of us so long to notice this. So, how could we have thought this faster?More people outside labs need to read system cards in full, within days of their release. Set up a reading club, make a really good agent scaffold, find some way of getting the important info into your brain and noticing inconsistencies.
Maybe someone can just build a really good Claude Code skill that does this for new model releases? Seems like a task that should take ~2 hours. Get in touch if you’d like to build this!
Lab safety teams should seek through the Frontier Model Forum and other, more informal coordination mechanisms to standardize benchmarks and set a norm of releasing lots of safety benchmark performance data.
Safety teams should prioritize releasing safety benchmark datasets, through Trusted Access Programs if appropriate. This will allow the safety community to directly compare different models’ benchmark scores, and get a better sense of risk outside of just “the new one scored less on this sabotage eval.”
In the absence of comprehensive and informative safety evaluations of frontier models from labs themselves, we hope the community can fill this gap while also pushing labs to be more transparent[6].
One question we don’t answer here is “what exactly is Pro???” Is it a different model, or a weird scaffold, or finetuning on Thinking, or something else? We don’t have great answers here; would love to learn more.See our editClaude has some important notes. “A few caveats worth flagging: the HLE ‘with tools’ rows use different harnesses (Gemini uses search-blocklist + code; OpenAI’s harness isn’t specified the same way), so that row is somewhat apples-to-oranges. BrowseComp similarly — Gemini specifies “Search + Python + Browse” while OpenAI’s tooling setup isn’t detailed identically. GPQA Diamond is essentially a tie at 94.3 vs 94.4.”
This was during a forthcoming safety evaluation of Kimi K2.5 with Yong, aligned with the kind of work we propose above.
The main blocker here is cost, but we think funders would be interested in throwing compute at this; we have seen preliminary interest from many stakeholders in the community, in both policy and technical circles.
This would also be extremely useful to get a better handle on risk from Chinese open-source models.
Huge thanks to everyone on the Kimi K2.5 eval team, without whom we would never have ran into this. We also thank Claude Opus 4.6, who accidentally ran Pro instead of Thinking on LAB-Bench and burnt $6k for what ended up being a good cause. We promise we are competent researchers, and have learnt our lesson.