One of our findings was that with synthetic data, it was necessary to have e1->e2 as the first hop in some two-hop question and e2->e3 as the second hop in some two hop question in order to learn e1->e3. This differs from your finding with “natural” facts: if e2->e3 is a “natural” fact, then it plausibly does appear as a second hop in some of the pretraining data. But you find generalization even when they synthetic e1->e2 is present only by itself, so there seems to be a further difference between natural facts and synthetic facts that appear as second hops.
We also found that learning synthetic two hop reasoning seems to take about twice as many parameters (or twice as much “knowledge capacity”) as learning only the one-hop questions from the same dataset, supporting the idea that, for transformers, learning to use a fact in either hop of a latent two-hop question requires something like learning that fact twice.
Did you try any experiments with a synthetic second hop instead of a synthetic first hop? It would be interesting to know whether “natural facts” can be composed flexibly with new facts or whether they can only be composed with new first hops. Our results suggest that there’s a substantial cost to making facts latently composable, so I think it would be surprising if many facts were flexibly composable, especially if many of those facts were reasonably rare.
Thanks! I somehow missed this paper, looks interesting!
Overall, I agree with the sentiment that two-hop latent reasoning is unreliable (e.g. our average accuracy is around 20%). We didn’t intend to leave readers with an impression that it “just works”. It seems very plausible to me that it’s less parameter-efficient and that there are some additional, unidentified factors required for successful two-hop reasoning.
Did you try any experiments with a synthetic second hop instead of a synthetic first hop?
We did not, but Jiahai Feng had an experiment like this in his paper.
We did some related work: https://arxiv.org/pdf/2502.03490.
One of our findings was that with synthetic data, it was necessary to have e1->e2 as the first hop in some two-hop question and e2->e3 as the second hop in some two hop question in order to learn e1->e3. This differs from your finding with “natural” facts: if e2->e3 is a “natural” fact, then it plausibly does appear as a second hop in some of the pretraining data. But you find generalization even when they synthetic e1->e2 is present only by itself, so there seems to be a further difference between natural facts and synthetic facts that appear as second hops.
We also found that learning synthetic two hop reasoning seems to take about twice as many parameters (or twice as much “knowledge capacity”) as learning only the one-hop questions from the same dataset, supporting the idea that, for transformers, learning to use a fact in either hop of a latent two-hop question requires something like learning that fact twice.
Did you try any experiments with a synthetic second hop instead of a synthetic first hop? It would be interesting to know whether “natural facts” can be composed flexibly with new facts or whether they can only be composed with new first hops. Our results suggest that there’s a substantial cost to making facts latently composable, so I think it would be surprising if many facts were flexibly composable, especially if many of those facts were reasonably rare.
Thanks! I somehow missed this paper, looks interesting!
Overall, I agree with the sentiment that two-hop latent reasoning is unreliable (e.g. our average accuracy is around 20%). We didn’t intend to leave readers with an impression that it “just works”. It seems very plausible to me that it’s less parameter-efficient and that there are some additional, unidentified factors required for successful two-hop reasoning.
We did not, but Jiahai Feng had an experiment like this in his paper.