If I understand correctly, one concern w/ the same document results (w/out other fact triplets) is that the model could just be learning a kind of bigram association between the names and the cities in each document.
If this was the case, we would still expect training with 3 other facts to yield non-zero accuracy and below chance loss. Thus, non-zero accuracy and below chance loss do not provide substantial evidence of two-hop reasoning.
To distinguish association and reasoning, you could compare performance to a baseline of sampling from all valid entities in the document (also its possible you’re already doing this and I missed it!)
That’s a fair point and I’m sympathetic to the opinion that two-hop performance in the same-document setup probably doesn’t count as true reasoning. I agree it’d be better to compare performance to a baseline of sampling from all valid entities in the document, I wish we did that!
preliminary results make me much more confident the model is doing “true” multi-hop reasoning in the 3-distractor triplets case. Most notable finding: on seed 42 (which I ran accidentally), same-doc w/ triplet distractors 2hop-no-cot accuracy improves to ~25% (compared to ~5% reported in the paper)
so I’m guessing there’s something like a “two-hop reasoning circuit” and “memorization circuit”, and most of the time you mostly get the memorization circuit but sometimes the two-hop reasoning circuits gets more reinforced.
This makes me fairly confident that training on a larger more diverse dataset would lead to fairly consistent two-hop reasoning (even with distractors).
These results also raise some questions on optimal data ordering (perhaps its better to finetune on the atomic facts then on the no-cot demonstrations?) but I mostly suspect these to be solved by scale.
Most notable finding: on seed 42 (which I ran accidentally), same-doc w/ triplet distractors 2hop-no-cot accuracy improves to ~25% (compared to ~5% reported in the paper)
Thanks for sharing! Yes, variance here is high. In the paper we reported results averaged across three random seeds. The “memorization vs generalization circuits” is also how I was thinking of it.
perhaps its better to finetune on the atomic facts then on the no-cot demonstrations?
I think we tried that in the fully synthetic setup and it didn’t work. But there might be some threshold of data diversity needed for two-hop circuits to form, and after that optimal ordering could help.
Yeah i ended up trying this too and it didn’t work (which is kind of an interesting finding itself, though yeah unclear if it applies to larger models / datasets)
maybe this baseline is too high b/c it assumes perfect retrieval of the document. Instead you could just measure the frequency that the model responds w/ an incorrect answer from the same document. If the correct answers are more frequent, this is evidence of multi-hop reasoning
Re: Experiment 3 w/ Distractors (Appendix E)
If I understand correctly, one concern w/ the same document results (w/out other fact triplets) is that the model could just be learning a kind of bigram association between the names and the cities in each document.
If this was the case, we would still expect training with 3 other facts to yield non-zero accuracy and below chance loss. Thus, non-zero accuracy and below chance loss do not provide substantial evidence of two-hop reasoning.
To distinguish association and reasoning, you could compare performance to a baseline of sampling from all valid entities in the document (also its possible you’re already doing this and I missed it!)
That’s a fair point and I’m sympathetic to the opinion that two-hop performance in the same-document setup probably doesn’t count as true reasoning. I agree it’d be better to compare performance to a baseline of sampling from all valid entities in the document, I wish we did that!
Tentatively planning on checking this, I’ll let you know what I find!
Please do!
preliminary results make me much more confident the model is doing “true” multi-hop reasoning in the 3-distractor triplets case. Most notable finding: on seed 42 (which I ran accidentally), same-doc w/ triplet distractors 2hop-no-cot accuracy improves to ~25% (compared to ~5% reported in the paper)
so I’m guessing there’s something like a “two-hop reasoning circuit” and “memorization circuit”, and most of the time you mostly get the memorization circuit but sometimes the two-hop reasoning circuits gets more reinforced.
This makes me fairly confident that training on a larger more diverse dataset would lead to fairly consistent two-hop reasoning (even with distractors).
These results also raise some questions on optimal data ordering (perhaps its better to finetune on the atomic facts then on the no-cot demonstrations?) but I mostly suspect these to be solved by scale.
Thanks for sharing! Yes, variance here is high. In the paper we reported results averaged across three random seeds. The “memorization vs generalization circuits” is also how I was thinking of it.
I think we tried that in the fully synthetic setup and it didn’t work. But there might be some threshold of data diversity needed for two-hop circuits to form, and after that optimal ordering could help.
Yeah i ended up trying this too and it didn’t work (which is kind of an interesting finding itself, though yeah unclear if it applies to larger models / datasets)
maybe this baseline is too high b/c it assumes perfect retrieval of the document. Instead you could just measure the frequency that the model responds w/ an incorrect answer from the same document. If the correct answers are more frequent, this is evidence of multi-hop reasoning