Most notable finding: on seed 42 (which I ran accidentally), same-doc w/ triplet distractors 2hop-no-cot accuracy improves to ~25% (compared to ~5% reported in the paper)
Thanks for sharing! Yes, variance here is high. In the paper we reported results averaged across three random seeds. The “memorization vs generalization circuits” is also how I was thinking of it.
perhaps its better to finetune on the atomic facts then on the no-cot demonstrations?
I think we tried that in the fully synthetic setup and it didn’t work. But there might be some threshold of data diversity needed for two-hop circuits to form, and after that optimal ordering could help.
Yeah i ended up trying this too and it didn’t work (which is kind of an interesting finding itself, though yeah unclear if it applies to larger models / datasets)
Thanks for sharing! Yes, variance here is high. In the paper we reported results averaged across three random seeds. The “memorization vs generalization circuits” is also how I was thinking of it.
I think we tried that in the fully synthetic setup and it didn’t work. But there might be some threshold of data diversity needed for two-hop circuits to form, and after that optimal ordering could help.
Yeah i ended up trying this too and it didn’t work (which is kind of an interesting finding itself, though yeah unclear if it applies to larger models / datasets)