I think it’s somewhat unclear, probably a bit of both:
Previous work has found that it’s surprisingly hard to poison pretraining with synthetic facts (https://arxiv.org/pdf/2410.13722)
Previous work has found that it’s easy to revert the effects of fine-tuning (e.g. the toy experiments in https://arxiv.org/pdf/2311.12786 or https://arxiv.org/abs/2405.19550), though I think there is no experiment which directly compares pretraining and fine-tuning in the exact way we want.
Additionally, LLMs are pretty good at memorizing very “random” data, such as big-bench canaries or phone numbers, even when they only appear a few times in pretraining. So my guess is that the “fine-tuning” part is where most of the effect is.
I think it’s somewhat unclear, probably a bit of both:
Previous work has found that it’s surprisingly hard to poison pretraining with synthetic facts (https://arxiv.org/pdf/2410.13722)
Previous work has found that it’s easy to revert the effects of fine-tuning (e.g. the toy experiments in https://arxiv.org/pdf/2311.12786 or https://arxiv.org/abs/2405.19550), though I think there is no experiment which directly compares pretraining and fine-tuning in the exact way we want.
Additionally, LLMs are pretty good at memorizing very “random” data, such as big-bench canaries or phone numbers, even when they only appear a few times in pretraining. So my guess is that the “fine-tuning” part is where most of the effect is.