Thanks for this! I was totally unaware of this quote. Also, from the GPT-5 system card:
Since gpt-5-thinking-pro is gpt-5-thinking using a setting that makes use of parallel test time compute, we have determined that the results from our safety evaluations on gpt-5-thinking are strong proxies, and therefore we did not rerun these evaluations in the parallel test time compute setting.
Response from Miles Brundage for the o3-pro lack of card:
“The whole point of the term system card is that the model isn’t the only thing that matters. If they didn’t do a full Preparedness Framework assessment, e.g. because the evals weren’t too different and they didn’t consider it a good use of time given other coming launches, they should just say that… lax processes/corner-cutting/groupthink get more dangerous each day.”
But the framework is full of ‘here are the test results’ and presumably those results are different now. I want o3-pro on those charts.
So, this has been thought about before! We’re sorry for not noticing and searching harder.
However, in the GPT-5 card OAI says “Because parallel test time compute can further increase performance on some evaluations and because gpt-5-thinking is near the High threshold in this capability domain, we also chose to measure gpt-5-thinking-pro’s performance on our biological evaluations.” We have no way of verifying whether they should’ve done the same here (and importantly, we don’t know if they even did this internally!). For this reason, we think our recommendations stand.
It’s probably incorrect to say the “SOTA model,” but we can say the “SOTA system”, or something? (It’s unclear whether this distinction even matters for catastrophic misuse risk, which is what we’re primarily concerned about for now.)
EDIT: I’ve now edited the blogpost. Thank you again :)))
Thanks for this! I was totally unaware of this quote. Also, from the GPT-5 system card:
Response from Miles Brundage for the o3-pro lack of card:
Response from Zvi for the o3-pro lack of card:
So, this has been thought about before! We’re sorry for not noticing and searching harder.
However, in the GPT-5 card OAI says “Because parallel test time compute can further increase performance on some evaluations and because gpt-5-thinking is near the High threshold in this capability domain, we also chose to measure gpt-5-thinking-pro’s performance on our biological evaluations.” We have no way of verifying whether they should’ve done the same here (and importantly, we don’t know if they even did this internally!). For this reason, we think our recommendations stand.
It’s probably incorrect to say the “SOTA model,” but we can say the “SOTA system”, or something? (It’s unclear whether this distinction even matters for catastrophic misuse risk, which is what we’re primarily concerned about for now.)
EDIT: I’ve now edited the blogpost. Thank you again :)))
IMO, the threat/thing to measure is the system, so it doesn’t much matter what a badly run model can do. I’m with you here.