AI strategy & governance. ailabwatch.org. Looking for new projects.
Zach Stein-Perlman(Zachary Stein-Perlman)
Safety-wise, they claim to have run it through their Preparedness framework and the red-team of external experts.
I’m disappointed and I think they shouldn’t get much credit PF-wise: they haven’t published their evals, published a report on results, or even published a high-level “scorecard.” They are not yet meeting the commitments in their beta Preparedness Framework — some stuff is unclear but at the least publishing the scorecard is an explicit commitment.
(Also: it’s now been six months since they published the beta Preparedness Framework!)
[Edit: not to say that we should feel much better if OpenAI was successfully implementing its PF—the thresholds are way too high and it says nothing about internal deployment.]
There should be points for how the organizations act wrt to legislation. In the SB 1047 bill that CAIS co-sponsored, we’ve noticed some AI companies to be much more antagonistic than others. I think [this] is probably a larger differentiator for an organization’s goodness or badness.
If there’s a good writeup on labs’ policy advocacy I’ll link to and maybe defer to it.
Adding to the confusion: I’ve nonpublicly heard from people at UKAISI and [OpenAI or Anthropic] that the Politico piece is very wrong and DeepMind isn’t the only lab doing pre-deployment sharing (and that it’s hard to say more because info about not-yet-deployed models is secret). But no clarification on commitments.
But everyone has lots of duties to keep secrets or preserve privacy and the ones put in writing often aren’t the most important. (E.g. in your case.)
I’ve signed ~3 NDAs. Most of them are irrelevant now and useless for people to know about, like yours.
I agree in special cases it would be good to flag such things — like agreements to not share your opinions on a person/org/topic, rather than just keeping trade secrets private.
Related: maybe a lab should get full points for a risky release if the lab says it’s releasing because the benefits of [informing / scaring / waking-up] people outweigh the direct risk of existential catastrophe and other downsides. It’s conceivable that a perfectly responsible lab would do such a thing.
Capturing all nuances can trade off against simplicity and legibility. (But my criteria are not yet on the efficient frontier or whatever.)
Thanks. I agree you’re pointing at something flawed in the current version and generally thorny. Strong-upvoted and strong-agreevoted.
Generally, the deployment criteria should be gated behind “has a plan to do this when models are actually powerful and their implementation of the plan is credible”.
I didn’t put much effort into clarifying this kind of thing because it’s currently moot—I don’t think it would change any lab’s score—but I agree.[1] I think e.g. a criterion “use KYC” should technically be replaced with “use KYC OR say/demonstrate that you’re prepared to implement KYC and have some capability/risk threshold to implement it and [that threshold isn’t too high].”
Don’t pass cost benefit for current models which pose low risk. (And it seems the criteria is “do you have them implemented right now?) . . . .
(A general problem with this project is somewhat arbitrarily requiring specific countermeasures. I think this is probably intrinsic to the approach I’m afraid.)
Yeah. The criteria can be like “implement them or demonstrate that you could implement them and have a good plan to do so,” but it would sometimes be reasonable for the lab to not have done this yet. (Especially for non-frontier labs; the deployment criteria mostly don’t work well for evaluating non-frontier labs. Also if demonstrating that you could implement something is difficult, even if you could implement it.)
I get the sense that this criteria doesn’t quite handle the necessarily edge cases to handle reasonable choices orgs might make.
I’m interested in suggestions :shrug:
- ^
And I think my site says some things that contradict this principle, like ‘these criteria require keeping weights private.’ Oops.
- ^
Two noncentral pages I like on the site:
Other scorecards & evaluation, collecting other safety-ish scorecard-ish resources.
Commitments, collecting AI companies’ commitments relevant to AI safety and extreme risks.
Yay @Zac Hatfield-Dodds of Anthropic for feedback and corrections including clarifying a couple of Anthropic’s policies. Two pieces of not-previously-public information:
I was disappointed that Anthropic’s Responsible Scaling Policy only mentions evaluation “During model training and fine-tuning.” Zac told me “this was a simple drafting error—our every-three months evaluation commitment is intended to continue during deployment. This has been clarified for the next version, and we’ve been acting accordingly all along.” Yay.
I said labs should have a “process for staff to escalate concerns about safety” and “have a process for staff and external stakeholders to share concerns about risk assessment policies or their implementation with the board and some other staff, including anonymously.” I noted that Anthropic’s RSP includes a commitment to “Implement a non-compliance reporting policy.” Zac told me “Beyond standard internal communications channels, our recently formalized non-compliance reporting policy meets these criteria [including independence], and will be described in the forthcoming RSP v1.1.” Yay.
I think it’s cool that Zac replied (but most of my questions for Anthropic remain).
I have not yet received substantive corrections/clarifications from any other labs.
(I have made some updates to ailabwatch.org based on Zac’s feedback—and revised Anthropic’s score from 45 to 48—but have not resolved all of it.)
I mostly agree. And I think when people say race dynamics they often actually mean speed of progress and especially “Effects of open models on ease of training closed models [and open models],” which you mention.
But here is a race-dynamics story:
Alice has the best open model. She prefers for AI progress to slow down but also prefers to have the best open model (for reasons of prestige or, if different companies’ models are not interchangeable, future market share). Bob releases a great open model. This incentivizes Alice to release a new state-of-the-art model sooner.
Yep, lots of people independently complain about “lab.” Some of those people want me to use scary words in other places too, like replacing “diffusion” with “proliferation.” I wouldn’t do that, and don’t replace “lab” with “mega-corp” or “juggernaut,” because it seems [incorrect / misleading / low-integrity].
I’m sympathetic to the complaint that “lab” is misleading. (And I do use “company” rather than “lab” occasionally, e.g. in the header.) But my friends usually talk about “the labs,” not “the companies.” But to most audiences “company” is more accurate.
I currently think “company” is about as good as “lab.” I may change the term throughout the site at some point.
This kind of feedback is very helpful to me; thank you! Strong-upvoted and weak-agreevoted.
(I have some factual disagreements. I may edit them into this comment later.)
(If you think Dan’s comment makes me suspect this project is full of issues/mistakes, react 💬 and I’ll consider writing a detailed soldier-ish reply.)
Thanks. Briefly:
I’m not sure what the theory of change for listing such questions is.
In the context of policy advocacy, think it’s sometimes fine/good for labs to say somewhat different things publicly vs privately. Like, if I was in charge of a lab and believed (1) the EU AI Act will almost certainly pass and (2) it has some major bugs that make my life harder without safety benefits, I might publicly say “I support (the goals of) the EU AI Act” and privately put some effort into removing those bugs, which is technically lobbying to weaken the Act.
(^I’m not claiming that particular labs did ~this rather than actually lobby against the Act. I just think it’s messy and regulation isn’t a one-dimensional thing that you’re for or against.)
Edit: this comment was misleading and partially replied to a strawman. I agree it would be good for the labs and their leaders to publicly say some things about recommended regulation (beyond what they already do) and their lobbying. I’m nervous about trying to litigate rumors for reasons I haven’t explained.
Edit 2: based on https://corporateeurope.org/en/2023/11/byte-byte, https://time.com/6288245/openai-eu-lobbying-ai-act/, and background information, I believe that OpenAI, Microsoft, Google, and Meta privately lobbied to make the EU AI Act worse—especially by lobbying against rules for foundation models—and that this is inconsistent with OpenAI’s and Altman’s public statements.
This post is not trying to shame labs for failing to answer before; I didn’t try hard to get them to answer. (The period was one week but I wasn’t expecting answers to my email / wouldn’t expect to receive a reply even if I waited longer.)
(Separately, I kinda hope the answers to basic questions like this are already written down somewhere...)
Questions for labs
Some overall scores are one point higher. Probably because my site rounds down. Probably my site should round to the nearest integer...
Thanks for the feedback. I’ll add “let people download all the data” to my todo list but likely won’t get to it. I’ll make a simple google sheet now.
Introducing AI Lab Watch
This is too strong. For example, releasing the product would be correct if someone else would do something similar soon anyway and you’re safer than them and releasing first lets you capture more of the free energy. (That’s not the case here, but it’s not as straightforward as you suggest, especially with your “Regardless of how good their alignment plans are” and your claim “There’s just no good reason to do that, except short-term greed”.)
Constellation (which I think has some important FHI-like virtues, although makes different tradeoffs and misses on others)
What is Constellation missing or what should it do? (Especially if you haven’t already told the Constellation team this.)
How can you make the case that a model is safe to deploy? For now, you can do risk assessment and notice that it doesn’t have dangerous capabilities. What about in the future, when models do have dangerous capabilities? Here are four options:
Implement safety measures as a function of risk assessment results, such that the measures feel like they should be sufficient to abate the risks
This is mostly what Anthropic’s RSP does (at least so far — maybe it’ll change when they define ASL-4)
Use risk assessment techniques that evaluate safety given deployment safety practices
This is mostly what OpenAI’s PF is supposed to do (measure “post-mitigation risk”), but the details of their evaluations and mitigations are very unclear
Do control evaluations
Achieve alignment (and get strong evidence of that)
Related: RSPs, safety cases.
Maybe lots of risk comes from the lab using AIs internally to do AI development. The first two options are fine for preventing catastrophic misuse from external deployment but I worry they struggle to measure risks related to scheming and internal deployment.