Zach Stein-Perlman(Zachary Stein-Perlman)

Karma: 3,902

AI strategy & governance. ailabwatch.org. Looking for new projects.

Zach Stein-Perlman 18 May 2024 21:30 UTC
9 points
0
on: Ilya Sutskever and Jan Leike resign from OpenAI
Edit: nevermind; seems like this tweet is narrow and just about vested equity; I’m not sure what that means in the context of OpenAI’s pseudo-equity but this tweet isn’t a big deal.
@gwern I’m interested in your take on this new Altman tweet:
we have never clawed back anyone’s vested equity, nor will we do that if people do not sign a separation agreement (or don’t agree to a non-disparagement agreement). vested equity is vested equity, full stop.
there was a provision about potential equity cancellation in our previous exit docs; although we never clawed anything back, it should never have been something we had in any documents or communication. this is on me and one of the few times i’ve been genuinely embarrassed running openai; i did not know this was happening and i should have.
the team was already in the process of fixing the standard exit paperwork over the past month or so. if any former employee who signed one of those old agreements is worried about it, they can contact me and we’ll fix that too. very sorry about this.
In particular “i did not know this was happening”

Zach Stein-Perlman 18 May 2024 17:14 UTC
6 points
2
in reply to: Akash’s comment on: DeepMind’s “Frontier Safety Framework” is weak and unambitious
Sorry for brevity.
We just disagree. E.g. you “walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks”; I felt like Anthropic was thinking about most stuff better.
I think Anthropic’s ASL-3 is reasonable and OpenAI’s thresholds and corresponding commitments are unreasonable. If the ASL-4 threshold was high or commitments are poor such that ASL-4 was meaningless, I agree Anthropic’s RSP would be at least as bad as OpenAI’s.
One thing I think is a big deal: Anthropic’s RSP treats internal deployment like external deployment; OpenAI’s has almost no protections for internal deployment.
I agree “an initial RSP that mostly spells out high-level reasoning, makes few hard commitments, and focuses on misuse while missing the all-important evals and safety practices for ASL-4” is also a fine characterization of Anthropic’s current RSP.
Quick edit: PF thresholds are too high; PF seems doomed / not on track. But RSPv1 is consistent with RSPv1.1 being great. At least Anthropic knows and says there’s a big hole. That’s not super relevant to evaluating labs’ current commitments but is very relevant to predicting.

Zach Stein-Perlman 18 May 2024 17:00 UTC
17 points
6
in reply to: Akash’s comment on: Akash’s Shortform
Sorry for brevity, I’m busy right now.
1. Noticing good stuff labs do, not just criticizing them, is often helpful. I wish you thought of this work more as “evaluation” than “criticism.”
2. It’s often important for evaluation to be quite truth-tracking. Criticism isn’t obviously good by default.
Edit:
3. I’m pretty sure OP likes good criticism of the labs; no comment on how OP is perceived. And I think I don’t understand your “good judgment” point. Feedback I’ve gotten on AI Lab Watch from senior AI safety people has been overwhelmingly positive, and of course there’s a selection effect in what I hear, but I’m quite sure most of them support such efforts.
4. Conjecture (not exclusively) has done things that frustrated me, including in dimensions like being “‘unilateralist,’ ‘not serious,’ and ‘untrustworthy.’” I think most criticism of Conjecture-related advocacy is legitimate and not just because people are opposed to criticizing labs.
5. I do agree on “soft power” and some of “jobs.” People often don’t criticize the labs publicly because they’re worried about negative effects on them, their org, or people associated with them.

Zach Stein-Perlman 18 May 2024 5:35 UTC
11 points
3
in reply to: habryka’s comment on: DeepMind’s “Frontier Safety Framework” is weak and unambitious
Yep
Two weeks ago I sent a senior DeepMind staff member some “Advice on RSPs, especially for avoiding ambiguities”; #1 on my list was “Clarify how your deployment commitments relate to internal deployment, not just external deployment” (since it’s easy and the OpenAI PF also did a bad job of this)
:(

DeepMind’s “Frontier Safety Framework” is weak and unambitious

Zach Stein-Perlman18 May 2024 3:00 UTC

114 points

9 comments4 min readLW link

DeepMind: Frontier Safety Framework

Zach Stein-Perlman17 May 2024 17:30 UTC

60 points

0 comments3 min readLW link

(deepmind.google)

Zach Stein-Perlman 17 May 2024 17:19 UTC
74 points
14
on: Ilya Sutskever and Jan Leike resign from OpenAI
Added updates to the post. Updating it as stuff happens. Not paying much attention; feel free to DM me or comment with more stuff.

Zach Stein-Perlman 15 May 2024 7:00 UTC
75 points
54
in reply to: Linch’s comment on: Ilya Sutskever and Jan Leike resign from OpenAI
The commitment—”20% of the compute we’ve secured to date” (in July 2023), to be used “over the next four years”—may be quite little in 2027, with compute use increasing exponentially. I’m confused about why people think it’s a big commitment.

Zach Stein-Perlman 15 May 2024 5:40 UTC
16 points
0
in reply to: Sodium’s comment on: Ilya Sutskever and Jan Leike resign from OpenAI
Two other executives left two weeks ago, but that’s not obviously safety-related.

Ilya Sutskever and Jan Leike resign from OpenAI [updated]

Zach Stein-Perlman15 May 2024 0:45 UTC

227 points

81 comments2 min readLW link

Zach Stein-Perlman 14 May 2024 5:50 UTC
13 points
9
in reply to: simeon_c’s comment on: OpenAI releases GPT-4o, natively interfacing with text, voice and vision
Full quote:
We’ve evaluated GPT-4o according to our Preparedness Framework and in line with our voluntary commitments. Our evaluations of cybersecurity, CBRN, persuasion, and model autonomy show that GPT-4o does not score above Medium risk in any of these categories. This assessment involved running a suite of automated and human evaluations throughout the model training process. We tested both pre-safety-mitigation and post-safety-mitigation versions of the model, using custom fine-tuning and prompts, to better elicit model capabilities.
GPT-4o has also undergone extensive external red teaming with 70+ external experts in domains such as social psychology, bias and fairness, and misinformation to identify risks that are introduced or amplified by the newly added modalities. We used these learnings to build out our safety interventions in order to improve the safety of interacting with GPT-4o. We will continue to mitigate new risks as they’re discovered.
[Edit after Simeon replied: I disagree with your interpretation that they’re being intentionally very deceptive. But I am annoyed by (1) them saying “We’ve evaluated GPT-4o according to our Preparedness Framework” when the PF doesn’t contain specific evals and (2) them taking credit for implementing their PF when they’re not meeting its commitments.]

Zach Stein-Perlman 13 May 2024 23:10 UTC
2 points
0
on: Zach Stein-Perlman’s Shortform
How can you make the case that a model is safe to deploy? For now, you can do risk assessment and notice that it doesn’t have dangerous capabilities. What about in the future, when models do have dangerous capabilities? Here are four options:
1. Implement safety measures as a function of risk assessment results, such that the measures feel like they should be sufficient to abate the risks
  1. This is mostly what Anthropic’s RSP does (at least so far — maybe it’ll change when they define ASL-4)
2. Use risk assessment techniques that evaluate safety given deployment safety practices
  1. This is mostly what OpenAI’s PF is supposed to do (measure “post-mitigation risk”), but the details of their evaluations and mitigations are very unclear
3. Do control evaluations
4. Achieve alignment (and get strong evidence of that)
Related: RSPs, safety cases.
Maybe lots of risk comes from the lab using AIs internally to do AI development. The first two options are fine for preventing catastrophic misuse from external deployment but I worry they struggle to measure risks related to scheming and internal deployment.

Zach Stein-Perlman 13 May 2024 19:41 UTC
73 points
61
on: OpenAI releases GPT-4o, natively interfacing with text, voice and vision
Safety-wise, they claim to have run it through their Preparedness framework and the red-team of external experts.
I’m disappointed and I think they shouldn’t get much credit PF-wise: they haven’t published their evals, published a report on results, or even published a high-level “scorecard.” They are not yet meeting the commitments in their beta Preparedness Framework — some stuff is unclear but at the least publishing the scorecard is an explicit commitment.
(It’s now been six months since they published the beta PF!)
[Edit: not to say that we should feel much better if OpenAI was successfully implementing its PF—the thresholds are way too high and it says nothing about internal deployment.]
What links here?
- DeepMind’s “Frontier Safety Framework” is weak and unambitious by Zach Stein-Perlman (18 May 2024 3:00 UTC; 114 points)
- DeepMind’s “Frontier Safety Framework” is weak and unambitious by Zach Stein-Perlman (EA Forum; 18 May 2024 3:00 UTC; 27 points)

Zach Stein-Perlman 12 May 2024 7:47 UTC
2 points
0
in reply to: Dan H’s comment on: Introducing AI Lab Watch
There should be points for how the organizations act wrt to legislation. In the SB 1047 bill that CAIS co-sponsored, we’ve noticed some AI companies to be much more antagonistic than others. I think [this] is probably a larger differentiator for an organization’s goodness or badness.
If there’s a good writeup on labs’ policy advocacy I’ll link to and maybe defer to it.

Zach Stein-Perlman 12 May 2024 2:34 UTC
18 points
10
in reply to: RobertM’s comment on: RobertM’s Shortform
Adding to the confusion: I’ve nonpublicly heard from people at UKAISI and [OpenAI or Anthropic] that the Politico piece is very wrong and DeepMind isn’t the only lab doing pre-deployment sharing (and that it’s hard to say more because info about not-yet-deployed models is secret). But no clarification on commitments.

Zach Stein-Perlman 11 May 2024 5:41 UTC
12 points
13
in reply to: Ben Pace’s comment on: simeon_c’s Shortform
But everyone has lots of duties to keep secrets or preserve privacy and the ones put in writing often aren’t the most important. (E.g. in your case.)
I’ve signed ~3 NDAs. Most of them are irrelevant now and useless for people to know about, like yours.
I agree in special cases it would be good to flag such things — like agreements to not share your opinions on a person/org/topic, rather than just keeping trade secrets private.

Zach Stein-Perlman 11 May 2024 5:35 UTC
6 points
2
in reply to: ryan_greenblatt’s comment on: Introducing AI Lab Watch
Related: maybe a lab should get full points for a risky release if the lab says it’s releasing because the benefits of [informing / scaring / waking-up] people outweigh the direct risk of existential catastrophe and other downsides. It’s conceivable that a perfectly responsible lab would do such a thing.
Capturing all nuances can trade off against simplicity and legibility. (But my criteria are not yet on the efficient frontier or whatever.)

Zach Stein-Perlman 10 May 2024 4:52 UTC
2 points
0
in reply to: ryan_greenblatt’s comment on: Introducing AI Lab Watch
Thanks. I agree you’re pointing at something flawed in the current version and generally thorny. Strong-upvoted and strong-agreevoted.
Generally, the deployment criteria should be gated behind “has a plan to do this when models are actually powerful and their implementation of the plan is credible”.
I didn’t put much effort into clarifying this kind of thing because it’s currently moot—I don’t think it would change any lab’s score—but I agree.^[1] I think e.g. a criterion “use KYC” should technically be replaced with “use KYC OR say/demonstrate that you’re prepared to implement KYC and have some capability/risk threshold to implement it and [that threshold isn’t too high].”
Don’t pass cost benefit for current models which pose low risk. (And it seems the criteria is “do you have them implemented right now?) . . . .
(A general problem with this project is somewhat arbitrarily requiring specific countermeasures. I think this is probably intrinsic to the approach I’m afraid.)
Yeah. The criteria can be like “implement them or demonstrate that you could implement them and have a good plan to do so,” but it would sometimes be reasonable for the lab to not have done this yet. (Especially for non-frontier labs; the deployment criteria mostly don’t work well for evaluating non-frontier labs. Also if demonstrating that you could implement something is difficult, even if you could implement it.)
I get the sense that this criteria doesn’t quite handle the necessarily edge cases to handle reasonable choices orgs might make.
I’m interested in suggestions :shrug:
1. ^
  And I think my site says some things that contradict this principle, like ‘these criteria require keeping weights private.’ Oops.

Zach Stein-Perlman 9 May 2024 2:30 UTC
6 points
4
on: Introducing AI Lab Watch
Two noncentral pages I like on the site:
- Other scorecards & evaluation, collecting other safety-ish scorecard-ish resources.
- Commitments, collecting AI companies’ commitments relevant to AI safety and extreme risks.

Zach Stein-Perlman 8 May 2024 23:00 UTC
8 points
8
on: Questions for labs
Yay @Zac Hatfield-Dodds of Anthropic for feedback and corrections including clarifying a couple of Anthropic’s policies. Two pieces of not-previously-public information:
- I was disappointed that Anthropic’s Responsible Scaling Policy only mentions evaluation “During model training and fine-tuning.” Zac told me “this was a simple drafting error—our every-three months evaluation commitment is intended to continue during deployment. This has been clarified for the next version, and we’ve been acting accordingly all along.” Yay.
- I said labs should have a “process for staff to escalate concerns about safety” and “have a process for staff and external stakeholders to share concerns about risk assessment policies or their implementation with the board and some other staff, including anonymously.” I noted that Anthropic’s RSP includes a commitment to “Implement a non-compliance reporting policy.” Zac told me “Beyond standard internal communications channels, our recently formalized non-compliance reporting policy meets these criteria [including independence], and will be described in the forthcoming RSP v1.1.” Yay.
I think it’s cool that Zac replied (but most of my questions for Anthropic remain).
I have not yet received substantive corrections/clarifications from any other labs.
(I have made some updates to ailabwatch.org based on Zac’s feedback—and revised Anthropic’s score from 45 to 48—but have not resolved all of it.)

Zach Stein-Perlman(Zachary Stein-Perlman)

Deep­Mind’s “​​Fron­tier Safety Frame­work” is weak and unambitious

Deep­Mind: Fron­tier Safety Framework

Ilya Sutskever and Jan Leike re­sign from OpenAI [up­dated]

DeepMind’s “Frontier Safety Framework” is weak and unambitious

DeepMind: Frontier Safety Framework

Ilya Sutskever and Jan Leike resign from OpenAI [updated]