I think this stuff is mostly a red herring: the safety standards in OpenAI’s new PF are super vague and so it will presumably always be able to say it meets them and will never have to use this.[1]
But if this ever matters, I think it’s good: it means OpenAI is more likely to make such a public statement and is slightly less incentivized to deceive employees + external observers about capabilities and safeguard adequacy. OpenAI unilaterally pausing is not on the table; if safeguards are inadequate, I’d rather OpenAI say so.
The High standard is super vague: just like “safeguards should sufficiently minimize the risk of severe harm” + level of evidence is totally unspecified for “potential safeguard efficacy assessments.” And some of the misalignment safeguards are confused/bad, and this is bad since the PF they may be disjunctive — if OpenAI is wrong about a single “safeguard efficacy assessment” that makes the whole plan invalid. And it’s bad that misalignment safeguards are only clearly triggered by cyber capabilities, especially since the cyber High threshold is vague / too high.
An interesting part in OpenAI’s new version of the preparedness framework:
I think this stuff is mostly a red herring: the safety standards in OpenAI’s new PF are super vague and so it will presumably always be able to say it meets them and will never have to use this.[1]
But if this ever matters, I think it’s good: it means OpenAI is more likely to make such a public statement and is slightly less incentivized to deceive employees + external observers about capabilities and safeguard adequacy. OpenAI unilaterally pausing is not on the table; if safeguards are inadequate, I’d rather OpenAI say so.
I think my main PF complaints are:
The High standard is super vague: just like “safeguards should sufficiently minimize the risk of severe harm” + level of evidence is totally unspecified for “potential safeguard efficacy assessments.” And some of the misalignment safeguards are confused/bad, and this is bad since the PF they may be disjunctive — if OpenAI is wrong about a single “safeguard efficacy assessment” that makes the whole plan invalid. And it’s bad that misalignment safeguards are only clearly triggered by cyber capabilities, especially since the cyber High threshold is vague / too high.
For more see OpenAI rewrote its Preparedness Framework.
Unrelated to vagueness they can also just change the framework again at any time.
This seems to have been foreshadowed by this tweet in February:
https://x.com/ChrisPainterYup/status/1886691559023767897
Would be good to keep track of this change