I’m the chief scientist at Redwood Research.
ryan_greenblatt
Note that repeating the problem X times also works (and yields similar performance increase to filler tokens given the optimal number of repeats/filler). Also yields similar boost across different types of filler (which you’d naively result in different suggestion etc.
I can quickly test this though, will run in one sec.
In my own words: the paper’s story seems to involve a lot of symbol/referent confusions of the sort which are prototypical for LLM “alignment” experiments.
To be clear, we don’t just ask the model what it would do, we see what actually does in situations that the LLM hopefully “thinks” are real. It could be that our interpretations of motives and beliefs (e.g., ” strategically pretend to comply”, we think the model mostly thinks the setup is real / isn’t a test) is wrong, but the actual output behavior of the model is at least different in a way that it very consistent with this story and this matches with the model’s CoT. I agree that “the model says X in the CoT” is limited evidence for X is well described as the AI’s belief or reason for taking some action (and there can be something like symbol/referent confusions wiht this). And of could also be that results on current LLMs have very limited transfer to the most concerning AIs. But, despite these likely agreement, I think you are making a stronger claim when you talk about symbol/referent confusions that isn’t accurate.
Can you clarify what you mean by meta-cognition?
It requires deciding what to think about at a given token, probably in a somewhat flexible way.
Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance
Filler tokens don’t allow for serially deeper cognition than what architectural limits allow (n-layers of processing), but they could totally allow for solving a higher fraction of “heavily serial” reasoning tasks [[1]] insofar as the LLM could still benefit from more parallel processing. For instance, the AI might by default be unable to do some serial step within 3 layers but can do that step within 3 layers if it can parallelize this over a bunch of filler tokens. This functionally could allow for more serial depth unless the AI is strongly bottlenecked on serial depth with no way for more layers to help (e.g., the shallowest viable computation graph has depth K and K is greater than the number of layers and the LLM can’t do multiple nodes in a single layer [[2]] ).
Paradoxically, xAI might be in a better position as a result of having fewer users, and so they might be able to serve their 6T total param Grok 5 starting early 2026 at a reasonable price.
If compute used for RL is comparable to compute used for inference for GDM and Anthropic, then serving to users might not be that important of a dimension. I guess it could be acceptable to have much slower inference for RL but not for serving to users.
The AIs are obviously fully (or almost fully) automating AI R&D and we’re trying to do control evaluations.
Looks like it isn’t specified again in the Opus 4.5 System card despite Anthropic clarifying this for Haiku 4.5 and Sonnet 4.5. Hopefully this is just a mistake...
The capability evaluations in the Opus 4.5 system card seem worrying. The provided evidence in the system card seem pretty weak (in terms of how much it supports Anthropic’s claims). I plan to write more about this in the future; here are some of my more quickly written up thoughts.
[This comment is based on this X/twitter thread I wrote]
I ultimately basically agree with their judgments about the capability thresholds they discuss. (I think the AI is very likely below the relevant AI R&D threshold, the CBRN-4 threshold, and the cyber thresholds.) But, if I just had access to the system card, I would be much more unsure. My view depends a lot on assuming some level of continuity from prior models (and assuming 4.5 Opus wasn’t a big scale up relative to prior models), on other evidence (e.g. METR time horizon results), and on some pretty illegible things (e.g. making assumptions about evaluations Anthropic ran or about the survey they did).
Some specifics:
Autonomy: For autonomy, evals are mostly saturated so they depend on an (underspecified) employee survey. They do specify a threshold, but the threshold seems totally consistent with a large chance of being above the relevant RSP threshold. (In particular the threshold is “A majority of employees surveyed think the AI can’t automate a junior researcher job AND a majority think uplift is <3x”. If 1⁄4 of employees thought it could automate a junior researcher job that would be a lot of evidence for a substantial chance it could!)
Cyber: Evals are mostly saturated. They don’t specify any threshold or argue for their ultimate judgement that the AI doesn’t pose catastrophic cyber risk.
Bio: To rule out the CBRN-4 threshold (uplift for moderately resourced state programs, e.g. North Korea), they seem to depend mostly on a text-based uplift trial. The model is extremely close to the relevant threshold and it’s unclear how much confidence we should have in this uplift trial.
Generally, it seems like the current situation is that capability evals don’t provide much assurance. This is partially Anthropic’s fault (they are supposed to do better) and partially because the problem is just difficult and unsolved.
I still think Anthropic is probably mostly doing a better job evaluating capabilities relative to other companies.
(It would be kinda reasonable for them to clearly say “Look, evaluating capabilities well is too hard and we have bigger things to worry about, so we’re going to half-ass this and make our best guess. This means we’re not longer providing much/any assurance, but we think this is a good tradeoff given the situation.”)
Some (quickly written) recommendations:
We should actually get some longer+harder AI R&D/autonomy tasks. E.g., tasks that take a human a week or two (and that junior researchers at Anthropic can somewhat reliably do). The employee survey should be improved (make sure employees have had access for >1-2 weeks, give us the exact questions, probably sanity check this more) and the threshold should probably be lower (if 1⁄4 of the employees do think the AI can automate a junior researcher, why should we have much confidence that it can’t!).
Anthropic should specify a threshold for cyber or make it clear what they using to make judgments. It would also be fine for them to say “We are no longer making a judgment on whether our AIs are above ASL-3 cyber, but we guess they probably aren’t. We won’t justify this.”
On bio, I think we need more third party review of their evals and some third party judgment of the situation because we’re plausibly getting into a pretty scary regime and their evals are extremely illegible.
We’re probably headed towards a regime of uncertainty and limited assurance. Right now is easy mode and we’re failing to some extent.
I’m just literally assuming that Plan B involves a moderate amount of lead time via the US having a lead or trying pretty hard to sabotage China, this is part of the plan/assumptions.
I don’t believe that datacenter security is actually a problem (see another argument).
Sorry, is your claim here that securing datacenters against highly resourced attacks from state actors (e.g. China) is going to be easy? This seems like a crazy claim to me.
(This link you cite isn’t about this claim, the link is about AI enabled cyber attacks not being a big deal because cyber attacks in general aren’t that big of a deal. I think I broadly agree with this, but think that stealing/tampering with model weights is a big deal.)
The China Problem: Plan B’s 13% risk doesn’t make sense if China (DeepSeek) doesn’t slow down and is only 3 months behind. Real risk is probably the same as for E, 75% unless there is a pivotal act.
What about the US trying somewhat hard to buy lead time, e.g., by sabotaging Chinese AI companies?
The framework treats political will as a background variable rather than a key strategic lever.
I roughly agree with this. It’s useful to condition on (initial) political will when making a technical plan, but I agree raising political will is important and one issue with this perspective is it might incorrectly make this less salient.
Not a crux for me ~at all. Some upstream views that make me think “AI takeover but humans stay alive” is more likely and also make me think avoiding AI takeover is relatively easier might be a crux.
I expect a roughly 5.5 month doubling time in the next year or two, but somewhat lower seems pretty likely. The proposed timeline I gave consistent with Anthropic’s predictions requires <1 month doubling times (and this is prior to >2x AI R&D acceleration, at least given my view of what you get at that level of capability).
I’d guess swe bench verified has an error rate around 5% or 10%. They didn’t have humans baseline the tasks, just look at them and see if they seem possible.
Wouldn’t you expect thing to look logistic substantially before full saturation?
Wouldn’t you expect this if we’re close to saturating SWE bench (and some of the tasks are impossible)? Like, you eventually cap out at the max performance for swe bench and this doesn’t correspond to an infinite time horizon on literally swe bench (you need to include more longer tasks).
I agree probably more work should go into this space. I think it is substantially less tractable than reducing takeover risk in aggregate, but much more neglected right now. I think work in this space has the capacity to be much more zero sum (among existing actors, avoiding AI takeover is zero sum with respect to the relevant AIs) and thus can be dodgier.
What’s up with Anthropic predicting AGI by early 2027?
Seems relevant post AGI/ASI (human labor is totally obsolete and AIs have massively increased energy output) maybe around the same point as when you’re starting to build stuff like Dyson swarms or other massive space based projects. But yeah, IMO probably irrelevant in the current regime (for next >30 years without AGI/ASI) and current human work in this direction probably doesn’t transfer.
I think the case in favor of space-based datacenters is that energy efficiency of space-based solar looks better: you can have perfect sun 100% of the time and you don’t have an atmosphere in the way. But, this probably isn’t a big enough factor to matter in realistic regimes without insane amounts of automation etc.
Recall that without filler, Opus 4.5 performance is 45.2%. I tried the following experiments on Opus 4.5 with filler counting to 300:
Default (what I do by default in the blog post above): 51.1%
Remove the text explaining filler tokens (as in, cut “After the problem …”): 50.4%
Use “After the problem, there will be distractor tokens (counting from 1 to {filler_tokens})”: 51.1%
Don’t actually use filler tokens, but include in the prompt “After the problem, there will be filler tokens (counting from 1 to 300) to give you extra space to process the problem before answering” (as in, this is just a lie, we don’t give filler): 45.8%
So, it seems like the framing doesn’t matter ~at all and actually having the filler tokens is the key thing (at least for Opus 4.5, though I strongly expect this would reproduce for Opus 4, Sonnet 4).