I’m the chief scientist at Redwood Research.
ryan_greenblatt
I think it’s more that LLMs were interestingly bad at n-hop latent reasoning and it might be generally representative of a certain type of within-single-forward weakness of current architectures. It’s not clear exactly what “n-hop” reasoning effectively about scheming without this information in context corresponds to.
I do think that relevant information won’t be in context during the best opportunities for misaligned AIs to sabotage, escape, or otherwise covertly achieve their aims, but the information the AIs need will be much more closesly associated rather than being unrelated hops.
Note that this linked video is talking about tokens that the LLM picks rather than fixed tokens.
I think the distribution of errors for numerical n-hop questions is going to be uninteresting/random most of the time because the only questions with numerical answers are either “day of month X was born” or “number of counties in US state X” where there isn’t any real reason for AIs to be close if they are wrong (either about X or the property of X)”.
However, I was interested in this question for “adding the result of N 1-hop questions”, so I got opus to do some additional analysis (for Gemini 3 Pro with 300 filler tokens):
The most interesting result here is that even on 6 addends, the model gets 75% within +-10 even though the answers are pretty big (median is 288 and many answers are much bigger).
Also, for some reason the model is systematically low, especially for 3 and 4 addends. Maybe because it sometimes skips one of the numbers?
I find that even with the longer prefill of “I will now answer immediately with the answer. The answer is” the model often reasons. I was hoping that the model would be reluctant to break this text prediction task and reason, but apparently not.
I think “how easy does the task seem” and “how much does the task seem like one on which reasoning seem like it should help” might have a big effect on whether the model respects the prefil vs reasons, so your sentence completion task might be not be representative of how the model always behaves.
It is in a single forward pass, just with additional fixed irrelevant tokens after. I think this still counts as “in a single forward pass” for the typical usage of the term. (It just doesn’t know the answer until somewhat later tokens.)
Separately, worth noting that the model doesn’t do that much worse without filler: performance only drops to 46%, 18%.
My choice to not mention the filler yet was because I didn’t want to introduce too much complexity yet and I think this number is the most representative number from a misalignment risk perspective; models will typically have a bunch of tokens somewhere they can use to do opaque reasoning.
The model can not output any tokens before answering, it has to respond immediately. This is what I mean by “no CoT”.
Ok, so this works but works less well than I initially thought. The model will still reason even with a prefil, just at some low rate. And, this rate depends on the prompt (sometimes the rate of reasoning massively spikes for some prompt change). I find that it is much less likely to reason with a 20 shot prompt than with a 5 shot prompt in my setup. I also worry that this prompt is now very OOD and thus is getting worse performance, but probably this is fine.
(Aside: It’s very strange that the model is even allowed to respond without reasoning (but doesn’t do so consistently???) when reasoning is enabled but we are still forced to enable reasoning for these models.)
Regardless, this does let me get results for Gemini 2.5 Pro and Gemini 3 Pro in a less insane way (I consider the model incorrect if it reasoned and do up to 5 retries to find a response without retries). I find that both models benefit from repeats/filler. At repeats=5, Gemini 3 Pro has a time horizon of 3.8 minutes while Gemini 2.5 Pro has a time horizon of 2.7 minutes. (Without repeats, the time horizons are 2.8 minutes and 1.9 minutes respectively.)
(Note that I’m considering the release date of 2.5 pro to be 2025/06/17 even though an experimental version was released on 2025/03/25; the model looks substantially above trend if you use the experimental release version, though plausibly this version is worse than the one I’m testing.)
Huh, that sure seems to work. Interesting.
I wonder if this is an intended feature...
By meta-cognition, in this context I mean “somewhat flexibly deciding what to think about / allocate cognition to in at least some cases”. More generally, I meant “this probably shows some type of internal cognitive sophistication of a type you might not have thought LLMs had”. I didn’t really mean anything very precise or to make a very strong claim. and probably in retrospect, I should have said something other than meta-cognition.
Also, I agree that it could be something else that it is not that interesting and doesn’t correspond to “somewhat flexibly deciding what to think about” and isn’t well described as very basic metacognition; probably my language was insufficiently caveated.
Recall that without filler, Opus 4.5 performance is 45.2%. I tried the following experiments on Opus 4.5 with filler counting to 300:
Default (what I do by default in the blog post above): 51.1%
Remove the text explaining filler tokens (as in, cut “After the problem …”): 50.4%
Use “After the problem, there will be distractor tokens (counting from 1 to {filler_tokens})”: 51.1%
Don’t actually use filler tokens, but include in the prompt “After the problem, there will be filler tokens (counting from 1 to 300) to give you extra space to process the problem before answering” (as in, this is just a lie, we don’t give filler): 45.8%
So, it seems like the framing doesn’t matter ~at all and actually having the filler tokens is the key thing (at least for Opus 4.5, though I strongly expect this would reproduce for Opus 4, Sonnet 4).
Note that repeating the problem X times also works (and yields similar performance increase to filler tokens given the optimal number of repeats/filler). Also yields similar boost across different types of filler (which you’d naively result in different suggestion etc.
I can quickly test this though, will run in one sec.
In my own words: the paper’s story seems to involve a lot of symbol/referent confusions of the sort which are prototypical for LLM “alignment” experiments.
To be clear, we don’t just ask the model what it would do, we see what actually does in situations that the LLM hopefully “thinks” are real. It could be that our interpretations of motives and beliefs (e.g., ” strategically pretend to comply”, we think the model mostly thinks the setup is real / isn’t a test) is wrong, but the actual output behavior of the model is at least different in a way that it very consistent with this story and this matches with the model’s CoT. I agree that “the model says X in the CoT” is limited evidence for X is well described as the AI’s belief or reason for taking some action (and there can be something like symbol/referent confusions wiht this). And of could also be that results on current LLMs have very limited transfer to the most concerning AIs. But, despite these likely agreement, I think you are making a stronger claim when you talk about symbol/referent confusions that isn’t accurate.
Can you clarify what you mean by meta-cognition?
It requires deciding what to think about at a given token, probably in a somewhat flexible way.
Filler tokens don’t allow for serially deeper cognition than what architectural limits allow (n-layers of processing), but they could totally allow for solving a higher fraction of “heavily serial” reasoning tasks [[1]] insofar as the LLM could still benefit from more parallel processing. For instance, the AI might by default be unable to do some serial step within 3 layers but can do that step within 3 layers if it can parallelize this over a bunch of filler tokens. This functionally could allow for more serial depth unless the AI is strongly bottlenecked on serial depth with no way for more layers to help (e.g., the shallowest viable computation graph has depth K and K is greater than the number of layers and the LLM can’t do multiple nodes in a single layer [[2]] ).
Paradoxically, xAI might be in a better position as a result of having fewer users, and so they might be able to serve their 6T total param Grok 5 starting early 2026 at a reasonable price.
If compute used for RL is comparable to compute used for inference for GDM and Anthropic, then serving to users might not be that important of a dimension. I guess it could be acceptable to have much slower inference for RL but not for serving to users.
The AIs are obviously fully (or almost fully) automating AI R&D and we’re trying to do control evaluations.
Looks like it isn’t specified again in the Opus 4.5 System card despite Anthropic clarifying this for Haiku 4.5 and Sonnet 4.5. Hopefully this is just a mistake...
The capability evaluations in the Opus 4.5 system card seem worrying. The provided evidence in the system card seem pretty weak (in terms of how much it supports Anthropic’s claims). I plan to write more about this in the future; here are some of my more quickly written up thoughts.
[This comment is based on this X/twitter thread I wrote]
I ultimately basically agree with their judgments about the capability thresholds they discuss. (I think the AI is very likely below the relevant AI R&D threshold, the CBRN-4 threshold, and the cyber thresholds.) But, if I just had access to the system card, I would be much more unsure. My view depends a lot on assuming some level of continuity from prior models (and assuming 4.5 Opus wasn’t a big scale up relative to prior models), on other evidence (e.g. METR time horizon results), and on some pretty illegible things (e.g. making assumptions about evaluations Anthropic ran or about the survey they did).
Some specifics:
Autonomy: For autonomy, evals are mostly saturated so they depend on an (underspecified) employee survey. They do specify a threshold, but the threshold seems totally consistent with a large chance of being above the relevant RSP threshold. (In particular the threshold is “A majority of employees surveyed think the AI can’t automate a junior researcher job AND a majority think uplift is <3x”. If 1⁄4 of employees thought it could automate a junior researcher job that would be a lot of evidence for a substantial chance it could!)
Cyber: Evals are mostly saturated. They don’t specify any threshold or argue for their ultimate judgement that the AI doesn’t pose catastrophic cyber risk.
Bio: To rule out the CBRN-4 threshold (uplift for moderately resourced state programs, e.g. North Korea), they seem to depend mostly on a text-based uplift trial. The model is extremely close to the relevant threshold and it’s unclear how much confidence we should have in this uplift trial.
Generally, it seems like the current situation is that capability evals don’t provide much assurance. This is partially Anthropic’s fault (they are supposed to do better) and partially because the problem is just difficult and unsolved.
I still think Anthropic is probably mostly doing a better job evaluating capabilities relative to other companies.
(It would be kinda reasonable for them to clearly say “Look, evaluating capabilities well is too hard and we have bigger things to worry about, so we’re going to half-ass this and make our best guess. This means we’re not longer providing much/any assurance, but we think this is a good tradeoff given the situation.”)
Some (quickly written) recommendations:
We should actually get some longer+harder AI R&D/autonomy tasks. E.g., tasks that take a human a week or two (and that junior researchers at Anthropic can somewhat reliably do). The employee survey should be improved (make sure employees have had access for >1-2 weeks, give us the exact questions, probably sanity check this more) and the threshold should probably be lower (if 1⁄4 of the employees do think the AI can automate a junior researcher, why should we have much confidence that it can’t!).
Anthropic should specify a threshold for cyber or make it clear what they using to make judgments. It would also be fine for them to say “We are no longer making a judgment on whether our AIs are above ASL-3 cyber, but we guess they probably aren’t. We won’t justify this.”
On bio, I think we need more third party review of their evals and some third party judgment of the situation because we’re plausibly getting into a pretty scary regime and their evals are extremely illegible.
We’re probably headed towards a regime of uncertainty and limited assurance. Right now is easy mode and we’re failing to some extent.
I’m just literally assuming that Plan B involves a moderate amount of lead time via the US having a lead or trying pretty hard to sabotage China, this is part of the plan/assumptions.
The model must respond in immediately with the answer. Any prefix would result in the answer being wrong. So only “[movie]” would be correct.
(Other than a few fixed generic ones like “the answer is” or “answer:” that I strip away, but the models virtually never output these, so considering these incorrect wouldn’t alter the results.)