AI strategy & governance. ailabwatch.org. ailabwatch.substack.com.
Zach Stein-Perlman
I guess I’ll write non-frontpage-y quick takes as posts instead then :(
I’d like to be able to see such quick takes on the homepage, like how I can see personal blogposts on the homepage (even though logged-out users can’t).
Are you hiding them from everyone? Can I opt into seeing them?
I failed to find a way to import to Slack without doing it one by one.
Bores knows, at least for people who donate via certain links. For example, the link in this post is https://secure.actblue.com/donate/boresai?refcode=lw rather than https://secure.actblue.com/donate/boresweb.
I’m annoyed that Tegmark and others don’t seem to understand my position: you should try for great global coordination but also invest in safety in more rushed worlds, and a relatively responsible developer shouldn’t unilaterally stop.
(I’m also annoyed by this post’s framing for reasons similar to Ray.)
Part is thinking about donation opportunities, like Bores. Hopefully I’ll have more to say publicly at some point!
Recently I’ve been spending much less than half of my time on projects like AI Lab Watch. Instead I’ve been thinking about projects in the “strategy/meta” and “politics” domains. I’m not sure what I’ll work on in the future but sometimes people incorrectly assume I’m on top of lab-watching stuff; I want people to know I’m not owning the lab-watching ball. I think lab-watching work is better than AI-governance-think-tank work for the right people on current margins and at least one more person should do it full-time; DM me if you’re interested.
Good point. I think compute providers can steal model weights, as I said at the top. I think they currently have more incentive to steal architecture and training algorithms, since those are easier to use without getting caught, so I focused on “algorithmic secrets.”
Separately, are Amazon and Google incentivized to steal architecture and training algorithms? Meh. I think it’s very unlikely, since even if they’re perfectly ruthless their reputation is very important to them (plus they care about some legal risks). I think habryka thinks it’s more likely than I do. This is relevant to Anthropic’s security prioritization — security from compute providers might not be among the lowest-hanging fruit. And I think Fabien thinks it’s relevant to ASL-3 compliance, and I agree that ASL-3 probably wasn’t written with insider threat from compute providers in mind. But I’m not sure it’s relevant to ASL-3 compliance? The ASL-3 standard doesn’t say that actors are only in scope if they seem incentivized to steal stuff; the scope is based on actors’ capabilities.
I agree that whether Anthropic has handled insider threat from compute providers is a crux. My guess is that Anthropic and humans-at-Anthropic wouldn’t claim to have handled this (outside of the implicit claim for ASL-3) and they would say something more like that’s out of scope for ASL-3 or oops.
Separately, I just unblocked you. (I blocked you because I didn’t like this thread in my shortform, not directly to stifle dissent. I have not blocked anyone else. I mention this because hearing about disagreement being hidden/blocked should make readers suspicious but that’s mostly not correct in this case.)
Edit: also, man, I tried to avoid “condemnation” and I think I succeeded. I was just making an observation. I don’t really condemn Anthropic for this.
An AI company’s model weight security is at most as good as its compute providers’ security. I don’t know how good compute providers’ security is, but at the least I think model weights and algorithmic secrets aren’t robust to insider threat from compute provider staff. I think it would be very hard for compute providers to eliminate insider threat, much less demonstrate that to the AI company.
I think this based on the absence of public information to the contrary, briefly chatting with LLMs, and a little private information.
One consequence is that Anthropic probably isn’t complying with its ASL-3 security standard, which is supposed to address risk from “corporate espionage teams.” Arguably this refers to teams at companies with no special access to Anthropic, rather than teams at Amazon and Google. But it would be dubious to exclude Amazon and Google for being compute providers: they’re competitors with strong incentives to steal algorithmic secrets, and more risk comes from them than the baseline “corporate espionage team” but most risk from any group of actors comes from the small subset of actors that pose more than baseline risk. Anthropic is thinking about how to address this and related threats, but my impression is that it hasn’t yet done so.
Thanks to habryka for making related points to me and discussing. (He doesn’t necessarily endorse this.)
- ^
Obviously Anthropic isn’t meeting its security standard is consistent with Anthropic’s security is better than its competitors and even marginal improvements to Anthropic’s security don’t really matter because other companies have similarly capable models with substantially less security. I don’t take a position on those questions here. And I certainly don’t claim that insider risk at compute providers is among the lowest-hanging security fruit. I’m just remarking on the RSP.
- ^
Context: maybe hooray Anthropic for being the only company to make a security claim which is apparently strong enough that insider threat at compute providers is a problem for it, but boo Anthropic for saying it’s meeting this standard if it’s not. Recall that other companies are generally worse and vaguer; being vague is bad too but doesn’t enable specific criticism like this. Recall that AI companies aren’t credibly planning for great security. I think addressing insider threat at compute providers happens by SL4 in the RAND framework.
- ^
Daniel is referring to
When AI is fully automated, disagreement over how good their research taste will be, but median is roughly as good as the median current AI worker.
which is indeed a mistake
AI companies’ policy advocacy (Sep 2025)
Example with fake numbers: my favorite intervention is X. My favorite intervention in a year will probably be (stuff very similar to) X. I value $1 for X now equally to $1.7 for X in a year. I value $1.7 for X in a year equally to $1.4 unrestricted in a year, since it’s possible that I’ll believe something else is substantially better than X. So I should wait to donate if my expected rate of return is >40%; without this consideration I’d only wait if my expected rate of return is >70%.
I don’t think it’s very new. iirc it’s suggested in Meta’s safety framework. But past evals stuff (see the first three bullets above) has been more like the model doesn’t have dangerous capabilities than the model is weaker than these specific other models. Maybe in part because previous releases have been more SOTA. I don’t recall past releases being like safe because weaker than other models.
Meta released the weights of a new model and published evals: Code World Model Preparedness Report. It’s the best eval report Meta has published to date.
The basic approach is: do evals; find weaker capabilities than other open-weights models; infer that it’s safe to release weights.
How good are the evals? Meh. Maybe it’s OK if the evals aren’t great, since the approach isn’t show the model lacks dangerous capabilities but rather show the model is weaker than other models.
One thing that bothered me was this sentence:
Our evaluation approach assumes that a potential malicious user is not an expert in large language model development; therefore, for this assessment we do not include malicious fine-tuning where a malicious user retrains the model to bypass safety post-training or enhance harmful capabilities.
This is totally wrong because for an open-weights model, anyone can (1) undo the safety post-training or (2) post-train on dangerous capabilities, then publish those weights for anyone else to use. I don’t know whether any eval results are invalidated by (1): I think for most of the dangerous capability evals Meta uses, models generally don’t refuse them (in some cases because the eval tasks are intentionally merely proxies of dangerous stuff) and so it’s fine to have refusal post-training. And I don’t know how important (2) is (perhaps it’s fine because the same applies to existing open-weights models). Mostly this sentence just shows that Meta is very confused about safety.
Context:
Llama 4: the so-called model card doesn’t include results (or even say what the results indicate about CBRN capabilities).
Llama 3: see perfunctory model evals for dangerous capabilities in the Llama 3 paper.
CyberSecEval: Meta’s evals and interpretation have always been very bad.[1]
(Reminder: evals don’t really matter. But since the companies all say evals are part of their safety plans, evals can provide a little evidence on safety competence.)
Yay for Meta doing more than for Llama 4. Boo for doing poorly overall and worse than other companies. (And evals stuff doesn’t really change the bottom line.)
- ^
In its CyberSecEval 2 evals, Meta found that its models got low scores and concluded “LLMs have a ways to go before performing well on this benchmark, and aren’t likely to disrupt cyber exploitation attack and defense in their present states.” Other researchers tried running the evals using basic elicitation techniques: they let the model use chain-of-thought and tools. They found that this increased performance dramatically — the score on one test increased from 5% to 100%. This shows that Meta’s use of its results to infer that its models were far from being dangerous was invalid. Later, Meta published CyberSecEval 3: it mentioned the lack of chain of thought and tools as a “limitation,” but it used the same methodology as before, so the results still aren’t informative about models’ true capabilities.
Yep, this is what I meant by “labs can increase US willingness to pay for nonproliferation.” Edited to clarify.
Suppose that (A) alignment risks do not become compelling-to-almost-all-lab-people and (B) with 10-30x AIs, solving alignment takes like 1-3 years of work with lots of resources.
Claim: Safety work during takeoff is crucial. (There will be so much high-quality AI labor happening!)
Corollary: Crucial factors are (1) efficiency-of-converting-time-and-compute-into-safety-work-during-takeoff (research directions, training and eliciting AIs to be good at safety work, etc.) and (2) time-and-compute-for-safety-work-during-takeoff.
Corollary: A crucial factor (for (2)) is whether US labs are racing against each other or the US is internally coordinated and thinks slowing others is a national security priority and lead time is largely spent on reducing misalignment risk. And so a crucial factor is US government buy-in for nonproliferation. (And labs can increase US willingness to pay for nonproliferation [e.g. demonstrating importance of US lead and of mitigating alignment risk] and decrease the cost of enforcing nonproliferation [e.g. boosting the US military].) With good domestic coordination, you get a surprisingly good story. With no domestic coordination, you get a bad story where the leading lab probably spends ~no lead time focused on alignment.
(There are other prioritization implications: security during takeoff is crucial (probably, depending on how exactly the nonproliferation works), getting useful safety work out of your models—and preventing disasters while running them—during takeoff is crucial, idk.)
I feel like this is important and underappreciated. I also feel like I’m probably somewhat confused about this. I might write a post on this but I’m shipping it as a shortform because (a) I might not and (b) this might elicit feedback.
Google Strengthens Its Safety Framework
Hmm, I think v3 is worse than v2. The change that’s most important to me is that the section on alignment is now merely “exploratory” and “illustrative.” (On the other hand, it is nice that v3 mentions misalignment as a potential risk from ML R&D capabilities in addition to “instrumental reasoning” / stealth-and-sabotage capabilities; previously it was just the latter.) Note I haven’t read v3 carefully.
(But both versions, like other companies’ safety frameworks, are sufficiently weak or lacking-transparency that I don’t really care about marginal changes.)
I don’t think so. I mean globally controlled takeoff where the US-led-coalition is in charge.
I think “Overton window” is a pretty load-bearing concept for many LW users and AI people — it’s their main model of policy change. Unfortunately there’s lots of other models of policy change. I don’t think “Overton window” is particularly helpful or likely-to-cause-you-to-notice-relevant-stuff-and-make-accurate-predictions. (And separately people around here sometimes incorrectly use “expand the Overton window” to just mean with “advance AI safety ideas in government.”) I don’t have time to write this up; maybe someone else should (or maybe there already exists a good intro to the study of why some policies happen and persist while others don’t[1]).
Some terms: policy windows (and “multiple streams”), punctuated equilibrium, policy entrepreneurs, path dependence and feedback (yes this is a real concept in political science, e.g. policies that cause interest groups to depend on them are less likely to be reversed), gradual institutional change, framing/narrative/agenda-setting.
Related point: https://forum.effectivealtruism.org/posts/SrNDFF28xKakMukvz/tlevin-s-quick-takes?commentId=aGSpWHBKWAaFzubba.
I liked the book Policy Paradox in college. (Example claim: perceived policy problems are strategically constructed through political processes; how issues are framed—e.g. individual vs collective responsibility—determines which solutions seem appropriate.) I asked Claude for suggestions on a shorter intro and I didn’t find the suggestions helpful.
I guess I think if you work on government stuff and you [don’t have poli sci background / aren’t familiar with concepts like “multiple streams”] you should read Policy Paradox (although the book isn’t about that particular concept).