My impression is that the “pro” models use the same weights as the underlying non-pro model (here, gpt-5.4-thinking) but with scaffolding on top that gets multiple reasoning traces and selects the best one. I think OpenAI’s view is that if the underlying model is safe to deploy, anything that’s just scaffolding on top of it must also be safe (because the safety checks for the underlying model should ensure it’s safe to deploy, even with malicious scaffolding).
As o3-pro uses the same underlying model as o3, full safety details can be found in the o3 system card.
They haven’t explicitly said the same for later pro models but various documentation about those pro models implies it.
Even though you can’t recreate the exact scaffolding OpenAI uses yourself (because the API doesn’t expose reasoning traces), you can get kinda close by querying the underlying non-pro model a bunch of times and asking a model to choose the best response[1]. It would probably be worth comparing gpt-5.4-thinking with that custom scaffold to gpt-5.4-thinking.
You would also want to have the underlying model include a summary of the reasoning in the output so that the model that chooses the best answer can decide which answer had the best reasoning.
Thanks for this! I was totally unaware of this quote. Also, from the GPT-5 system card:
Since gpt-5-thinking-pro is gpt-5-thinking using a setting that makes use of parallel test time compute, we have determined that the results from our safety evaluations on gpt-5-thinking are strong proxies, and therefore we did not rerun these evaluations in the parallel test time compute setting.
Response from Miles Brundage for the o3-pro lack of card:
“The whole point of the term system card is that the model isn’t the only thing that matters. If they didn’t do a full Preparedness Framework assessment, e.g. because the evals weren’t too different and they didn’t consider it a good use of time given other coming launches, they should just say that… lax processes/corner-cutting/groupthink get more dangerous each day.”
But the framework is full of ‘here are the test results’ and presumably those results are different now. I want o3-pro on those charts.
So, this has been thought about before! We’re sorry for not noticing and searching harder.
However, in the GPT-5 card OAI says “Because parallel test time compute can further increase performance on some evaluations and because gpt-5-thinking is near the High threshold in this capability domain, we also chose to measure gpt-5-thinking-pro’s performance on our biological evaluations.” We have no way of verifying whether they should’ve done the same here (and importantly, we don’t know if they even did this internally!). For this reason, we think our recommendations stand.
It’s probably incorrect to say the “SOTA model,” but we can say the “SOTA system”, or something? (It’s unclear whether this distinction even matters for catastrophic misuse risk, which is what we’re primarily concerned about for now.)
EDIT: I’ve now edited the blogpost. Thank you again :)))
My impression is that the “pro” models use the same weights as the underlying non-pro model (here, gpt-5.4-thinking) but with scaffolding on top that gets multiple reasoning traces and selects the best one. I think OpenAI’s view is that if the underlying model is safe to deploy, anything that’s just scaffolding on top of it must also be safe (because the safety checks for the underlying model should ensure it’s safe to deploy, even with malicious scaffolding).
With o3-pro OpenAI said:
They haven’t explicitly said the same for later pro models but various documentation about those pro models implies it.
Even though you can’t recreate the exact scaffolding OpenAI uses yourself (because the API doesn’t expose reasoning traces), you can get kinda close by querying the underlying non-pro model a bunch of times and asking a model to choose the best response[1]. It would probably be worth comparing gpt-5.4-thinking with that custom scaffold to gpt-5.4-thinking.
You would also want to have the underlying model include a summary of the reasoning in the output so that the model that chooses the best answer can decide which answer had the best reasoning.
Thanks for this! I was totally unaware of this quote. Also, from the GPT-5 system card:
Response from Miles Brundage for the o3-pro lack of card:
Response from Zvi for the o3-pro lack of card:
So, this has been thought about before! We’re sorry for not noticing and searching harder.
However, in the GPT-5 card OAI says “Because parallel test time compute can further increase performance on some evaluations and because gpt-5-thinking is near the High threshold in this capability domain, we also chose to measure gpt-5-thinking-pro’s performance on our biological evaluations.” We have no way of verifying whether they should’ve done the same here (and importantly, we don’t know if they even did this internally!). For this reason, we think our recommendations stand.
It’s probably incorrect to say the “SOTA model,” but we can say the “SOTA system”, or something? (It’s unclear whether this distinction even matters for catastrophic misuse risk, which is what we’re primarily concerned about for now.)
EDIT: I’ve now edited the blogpost. Thank you again :)))
IMO, the threat/thing to measure is the system, so it doesn’t much matter what a badly run model can do. I’m with you here.