AI strategy & governance. Blog: Not Optional.
Zach Stein-Perlman(Zachary Stein-Perlman)
You left out the word “meaningfully,” which is quite important.
I think deploying Claude 3 was fine and most AI safety people are confused about the effects of deploying frontier-ish models. I haven’t seen anyone else articulate my position recently so I’d probably be down to dialogue. Or maybe I should start by writing a post.
[Edit: this comment got lots of agree-votes and “Deploying Claude 3 increased AI risk” got lots of disagreement so maybe actually everyone agrees it was fine.]
Deploying Claude 3 increased AI risk.
AGI is defined as “a highly autonomous system that outperforms humans at most economically valuable work.” We can hope, but it’s definitely not clear that AGI comes before existentially-dangerous-AI.
New misc remark:
It’s not clear how the PF interacts with sharing models with Microsoft (or others). In particular, if OpenAI is required to share its models with Microsoft and Microsoft can just deploy them, even a great PF wouldn’t stop dangerous models from being deployed. See OpenAI-Microsoft partnership.
Good work.
One plausibly-important factor I wish you’d tracked: whether the company offers bug bounties.
[Edit: didn’t mean to suggest David’s post is redundant.]
I agree. But I claim saying “I can’t talk about the game itself, as that’s forbidden by the rules” is like saying “I won’t talk about the game itself because I decided not to”—the underlying reason is unclear.
Unfortunately, I can’t talk about the game itself, as that’s forbidden by the rules.
You two can just change the rules… I’m confused by this rule.
The control-y plan I’m excited about doesn’t feel to me like squeeze useful work out of clearly misaligned models. It’s like use scaffolding/oversight to make using a model safer, and get decent arguments that using the model (in certain constrained ways) is pretty safe even if it’s scheming. Then if you ever catch the model scheming, do some combination of (1) stop using it, (2) produce legible evidence of scheming, and (3) do few-shot catastrophe prevention. But I defer to Ryan here.
New misc remark:
OpenAI’s commitments about deployment seem to just refer to external deployment, unfortunately.
This isn’t explicit, but they say “Deployment in this case refers to the spectrum of ways of releasing a technology for external impact.”
This contrasts with Anthropic’s RSP, in which “deployment” includes internal use.
Good post.
Added to the post:
Edit, one day later: the structure seems good, but I’m very concerned that the thresholds for High and Critical risk in each category are way too high, such that e.g. a system could very plausibly kill everyone without reaching Critical in any category. See pp. 8–11. If so, that’s a fatal flaw for a framework like this. I’m interested in counterarguments; for now, praise mostly retracted; oops. I still prefer this to no RSP-y-thing, but I was expecting something stronger from OpenAI. I really hope they lower thresholds for the finished version of this framework.
Any mention of what the “mitigations” in question would be?
Not really. For now, OpenAI mostly mentions restricting deployment (this section is pretty disappointing):
A central part of meeting our safety baselines is implementing mitigations to address various types of model risk. Our mitigation strategy will involve both containment measures, which help reduce risks related to possession of a frontier model, as well as deployment mitigations, which help reduce risks from active use of a frontier model. As a result, these mitigations might span increasing compartmentalization, restricting deployment to trusted users, implementing refusals, redacting training data, or alerting distribution partners.
I predict that if you read the doc carefully, you’d say “probably net-harmful relative to just not pretending to have a safety plan in the first place.”
Some personal takes in response:
Yeah, largely the letter of the law isn’t sufficient.
Some evals are hard to goodhart. E.g. “can red-teamers demonstrate problems (given our mitigations)” is pretty robust — if red-teamers can’t demonstrate problems, that’s good evidence of safety (for deployment with those mitigations), even if the mitigations feel jury-rigged.
Yeah, this is intended to be complemented by superalignment.
(I edited my comment to add the market, sorry for confusion.)
(Separately, a market like you describe might still be worth making.)
(The label “RSP” isn’t perfect but it’s kinda established now. My friends all call things like this “RSPs.” And anyway I don’t think “PFs” should become canonical instead. I predict change in terminology will happen ~iff it’s attempted by METR or multiple frontier labs together. For now, I claim we should debate terminology occasionally but follow standard usage when trying to actually communicate.)
Nice. Another possible market topic: the mix of Low/Medium/High/Critical on their Scorecard, when they launch it or on 1 Jan 2025 or something. Hard to operationalize because we don’t know how many categories there will be, and we care about both pre-mitigation and post-mitigation scores.
Made a simple market:
Cool, changing my vote from “no” to “yes”