Zach Stein-Perlman comments on Zach Stein-Perlman’s Shortform

Zach Stein-Perlman 26 Sep 2025 17:45 UTC
31 points
5
Meta released the weights of a new model and published evals: Code World Model Preparedness Report. It’s the best eval report Meta has published to date.
The basic approach is: do evals; find weaker capabilities than other open-weights models; infer that it’s safe to release weights.
How good are the evals? Meh. Maybe it’s OK if the evals aren’t great, since the approach isn’t show the model lacks dangerous capabilities but rather show the model is weaker than other models.
One thing that bothered me was this sentence:
Our evaluation approach assumes that a potential malicious user is not an expert in large language model development; therefore, for this assessment we do not include malicious fine-tuning where a malicious user retrains the model to bypass safety post-training or enhance harmful capabilities.
This is totally wrong because for an open-weights model, anyone can (1) undo the safety post-training or (2) post-train on dangerous capabilities, then publish those weights for anyone else to use. I don’t know whether any eval results are invalidated by (1): I think for most of the dangerous capability evals Meta uses, models generally don’t refuse them (in some cases because the eval tasks are intentionally merely proxies of dangerous stuff) and so it’s fine to have refusal post-training. And I don’t know how important (2) is (perhaps it’s fine because the same applies to existing open-weights models). Mostly this sentence just shows that Meta is very confused about safety.
Context:
- Llama 4: the so-called model card doesn’t include results (or even say what the results indicate about CBRN capabilities).
- Llama 3: see perfunctory model evals for dangerous capabilities in the Llama 3 paper.
- CyberSecEval: Meta’s evals and interpretation have always been very bad.^[1]
- Meta’s safety framework is ~meaningless.
- (Reminder: evals don’t really matter. But since the companies all say evals are part of their safety plans, evals can provide a little evidence on safety competence.)
Yay for Meta doing more than for Llama 4. Boo for doing poorly overall and worse than other companies. (And evals stuff doesn’t really change the bottom line.)
1. ^
  In its CyberSecEval 2 evals, Meta found that its models got low scores and concluded “LLMs have a ways to go before performing well on this benchmark, and aren’t likely to disrupt cyber exploitation attack and defense in their present states.” Other researchers tried running the evals using basic elicitation techniques: they let the model use chain-of-thought and tools. They found that this increased performance dramatically — the score on one test increased from 5% to 100%. This shows that Meta’s use of its results to infer that its models were far from being dangerous was invalid. Later, Meta published CyberSecEval 3: it mentioned the lack of chain of thought and tools as a “limitation,” but it used the same methodology as before, so the results still aren’t informative about models’ true capabilities.
- ZY 26 Sep 2025 18:28 UTC
  1 point
  0
  Parent
  The basic approach is: do evals; find weaker capabilities than other open-weights models; infer that it’s safe to release weights.
  Curious—what made you think this is new to Code World Model comparing to other Meta releases?
  - Zach Stein-Perlman 26 Sep 2025 18:32 UTC
    2 points
    0
    Parent
    I don’t think it’s very new. iirc it’s suggested in Meta’s safety framework. But past evals stuff (see the first three bullets above) has been more like the model doesn’t have dangerous capabilities than the model is weaker than these specific other models. Maybe in part because previous releases have been more SOTA. I don’t recall past releases being like safe because weaker than other models.