Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared to other approaches in AI, LLMs offer a unique advantage: they can ``think out loud″ using chain-of-thought (CoT) enabling oversight of their decision-making processes. Yet the reliability of such monitoring hinges on an empirical question: do models need to externalize their reasoning in human language, or can they achieve the same performance through opaque internal computation?
In our new paper, we investigate LLM latent reasoning capabilities using two-hop question answering as a case study. We fine-tune LLMs (including Llama 3 8B and GPT-4o) on synthetic facts and test two-hop reasoning over these facts. By using synthetic facts, we rule out memorization and reasoning shortcuts as explanations for two-hop performance. We observe a nuanced picture:
Finding 1: Models completely fail to compose synthetic facts they learned through fine-tuning without explicit chain-of-thought reasoning, achieving only chance-level accuracy despite perfect recall of the individual facts.
Finding 2: Interventions to (i) force a correct fact storage order across transformer layers and (ii) encourage the first reasoning hop both fail to enable models to compose newly learned facts without chain-of-thought.
Finding 3: Models successfully compose newly learned synthetic facts without chain-of-thought when those facts co-occur in the same fine-tuning document or in the same test-time prompt.
Finding 4: LLMs are capable of composing two separately learned facts, as long as one of the facts is naturally acquired during pretraining (the second fact can be synthetic and acquired through fine-tuning).
These findings offer a humbling lesson: researchers observing models answering two-hop questions about real-world facts might incorrectly infer robust latent reasoning capabilities, not realizing that performance relies on facts appearing together in pretraining data. Conversely, our synthetic fine-tuning results alone might suggest models fundamentally cannot perform latent composition—a kind of limitation that could make oversight via reasoning traces highly reliable. Yet this too would be misleading given models’ success in other settings. This sensitivity to experimental conditions serves as a cautionary tale for the broader project of understanding LLM reasoning: apparent limitations may be artifacts of experimental setups far removed from how frontier models are trained, while apparent successes may rely on shortcuts invisible to experimenters.
Moving forward, we believe that researchers hoping to gain confidence in LLM agent oversight based on CoT monitoring should move beyond multihop question answering as a toy model of reasoning. What ultimately matters for monitoring capable LLM agents is whether oversight mechanisms can catch misbehavior in practice. Determining monitorability might thus be best achieved via end-to-end evaluations, where LLM agents attempt to complete agentic tasks while their CoT is monitored by another model. Such evaluations could reveal whether models are capable of latent reasoning that would allow carrying out harmful tasks without externalizing harmful intentions in a readable way.
RE part 6:
I think there’s a more intuitive/abstract framing here. If a model has only seen e_2 with respect to two different facts, it probably won’t have generated an abstraction for e_2 in its world model at all. An abstraction is mostly useful as a hub of different inferences, like in the old blegg/rube diagram.
Something which has come up in pretraining will already be an abstraction with an easy-to-reach-for handle that the model can pull.
Might be testable by fine-tuning on only some of (or some pairs of) the spokes of a blegg/rube diagram, to see whether the final spoke-pairs fill in.
I.e.
“This object is round, so it’s a blegg, so it’s blue”
“This object is smooth, so it’s a blegg, so it’s round”
“This object is smooth, so it’s a blegg, so it’s bouncy”
“This object is round, is it bouncy?”
Something like that might cause “blegg” to be bound up and assembled into an abstraction in the AI, with a single representation.
Overall I consider this work to be weak evidence in favour of multi-step reasoning being an issue, since the latter parts show that it definitely can occur (just not if both facts are fine-tuned separately)
Yeah, it seems plausible that entity being activated across different context is necessary for it being represented saliently enough to facilitate multi-hop reasoning. The Grokked transformer paper has some results linking the ratio of e2 and e3 to two-hop performance (in toy settings).
Do you have an explanation for there is a bridge-entity representation mismatch for synthetic facts but not for real ones? What in “real” training allows LLMs to learn a common representation of input and output entities? Can you emulate that with additional fine-tuning on more synthetic documents?
We don’t have a good explanation. One idea could be that you need bridge entities to be somehow more internalized to support latent two-hop reasoning, e.g. they need to occur in many facts as first and as second entities or maybe they need to occur in other two-hop questions. The Grokked transformers paper has some results linking the ratio of e2 and e3 to two-hop performance (in toy grokking settings).
I think those are useful, but I think more basic science results are also important, especially when they take the form “LLMs differ from humans learning from data in way X” because it lets you find ways in which modeling LLMs as “like humans, but weaker” is wrong. In particular, I think the reversal curse paper updated me much more than this GDM paper (though I think the GDM paper has other qualities, like illustrating a specific hope for CoT monitoring, which I think is important). Just because you failed at finding such a result in this case and got a more messy “LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases” doesn’t mean there aren’t other reversal-curse-like results that remain to be found.
But I think these “direct observation” papers will get somewhat more informative as we get closer to AIs that are actually scary, since the setups could actually get quite close to real catastrophic reasoning (as opposed to the more toy setups from the GDM paper).
I think that’s fair, it might be that we’ve over-updated a bit after getting results we did not expect (we did expect a reversal-curse-like phenomenon).
Two big reasons why I’m hesitant to draw conclusions about monitorability in agents setting is that our setup is a simplistic (QA, non-frontier models) and we don’t offer a clean explanation of why we see the results we see.
Re: Experiment 3 w/ Distractors (Appendix E)
If I understand correctly, one concern w/ the same document results (w/out other fact triplets) is that the model could just be learning a kind of bigram association between the names and the cities in each document.
If this was the case, we would still expect training with 3 other facts to yield non-zero accuracy and below chance loss. Thus, non-zero accuracy and below chance loss do not provide substantial evidence of two-hop reasoning.
To distinguish association and reasoning, you could compare performance to a baseline of sampling from all valid entities in the document (also its possible you’re already doing this and I missed it!)
That’s a fair point and I’m sympathetic to the opinion that two-hop performance in the same-document setup probably doesn’t count as true reasoning. I agree it’d be better to compare performance to a baseline of sampling from all valid entities in the document, I wish we did that!
Tentatively planning on checking this, I’ll let you know what I find!
Please do!
preliminary results make me much more confident the model is doing “true” multi-hop reasoning in the 3-distractor triplets case. Most notable finding: on seed 42 (which I ran accidentally), same-doc w/ triplet distractors 2hop-no-cot accuracy improves to ~25% (compared to ~5% reported in the paper)
so I’m guessing there’s something like a “two-hop reasoning circuit” and “memorization circuit”, and most of the time you mostly get the memorization circuit but sometimes the two-hop reasoning circuits gets more reinforced.
This makes me fairly confident that training on a larger more diverse dataset would lead to fairly consistent two-hop reasoning (even with distractors).
These results also raise some questions on optimal data ordering (perhaps its better to finetune on the atomic facts then on the no-cot demonstrations?) but I mostly suspect these to be solved by scale.
Thanks for sharing! Yes, variance here is high. In the paper we reported results averaged across three random seeds. The “memorization vs generalization circuits” is also how I was thinking of it.
I think we tried that in the fully synthetic setup and it didn’t work. But there might be some threshold of data diversity needed for two-hop circuits to form, and after that optimal ordering could help.
Yeah i ended up trying this too and it didn’t work (which is kind of an interesting finding itself, though yeah unclear if it applies to larger models / datasets)
maybe this baseline is too high b/c it assumes perfect retrieval of the document. Instead you could just measure the frequency that the model responds w/ an incorrect answer from the same document. If the correct answers are more frequent, this is evidence of multi-hop reasoning
We did some related work: https://arxiv.org/pdf/2502.03490.
One of our findings was that with synthetic data, it was necessary to have e1->e2 as the first hop in some two-hop question and e2->e3 as the second hop in some two hop question in order to learn e1->e3. This differs from your finding with “natural” facts: if e2->e3 is a “natural” fact, then it plausibly does appear as a second hop in some of the pretraining data. But you find generalization even when they synthetic e1->e2 is present only by itself, so there seems to be a further difference between natural facts and synthetic facts that appear as second hops.
We also found that learning synthetic two hop reasoning seems to take about twice as many parameters (or twice as much “knowledge capacity”) as learning only the one-hop questions from the same dataset, supporting the idea that, for transformers, learning to use a fact in either hop of a latent two-hop question requires something like learning that fact twice.
Did you try any experiments with a synthetic second hop instead of a synthetic first hop? It would be interesting to know whether “natural facts” can be composed flexibly with new facts or whether they can only be composed with new first hops. Our results suggest that there’s a substantial cost to making facts latently composable, so I think it would be surprising if many facts were flexibly composable, especially if many of those facts were reasonably rare.
Thanks! I somehow missed this paper, looks interesting!
Overall, I agree with the sentiment that two-hop latent reasoning is unreliable (e.g. our average accuracy is around 20%). We didn’t intend to leave readers with an impression that it “just works”. It seems very plausible to me that it’s less parameter-efficient and that there are some additional, unidentified factors required for successful two-hop reasoning.
We did not, but Jiahai Feng had an experiment like this in his paper.