Great post! It’s been almost a year since this was posted so I was curious if anyone has worked on these questions:
Do you get any weird results from the pre-training data not being IID? Does this compromise capabilities in practice? Or does it lead to increased capabilities because the model cannot lean as much on memorization when it’s constantly getting trained on a previously-unseen future?
What if you want to run multiple epochs?[21] Then you have a conflict between wanting to fully update on the old data before you see new data vs. wanting to maximally spread out the points in time at which you repeat training data. How severe is this conflict? Are there any clever methods that could reduce it?
I did a quick lit review and didn’t find much. Here’s what I did find (not perfectly related to the above questions, though).
This GitHub issue explored whether training data order affects memorization. They attempted to prompt an LLM with the first 20 tokens of each document in its training set and plot the number of subsequent correct reproduced tokens against the position of the document in the training set. They did not find a statistically significant relationship.
This paper tried to train chronologically consistent LLMs, while mitigating future training data leakage. Their models performed performed relatively the same as normal LLMs. However, it’s not clear to me how well they filtered their data. The only experiment they ran to “prove” that their training data wasn’t contaminated with future events was to predict future presidents. They found that their models checkpointed from 1999 to 2024 were always unable to predict the correct future president. This is not strong enough evidence IMO.
So, has anyone pursued the two quoted questions above? Super curious if anyone has good results!
Great post! It’s been almost a year since this was posted so I was curious if anyone has worked on these questions:
I did a quick lit review and didn’t find much. Here’s what I did find (not perfectly related to the above questions, though).
This GitHub issue explored whether training data order affects memorization. They attempted to prompt an LLM with the first 20 tokens of each document in its training set and plot the number of subsequent correct reproduced tokens against the position of the document in the training set. They did not find a statistically significant relationship.
This paper tried to train chronologically consistent LLMs, while mitigating future training data leakage. Their models performed performed relatively the same as normal LLMs. However, it’s not clear to me how well they filtered their data. The only experiment they ran to “prove” that their training data wasn’t contaminated with future events was to predict future presidents. They found that their models checkpointed from 1999 to 2024 were always unable to predict the correct future president. This is not strong enough evidence IMO.
So, has anyone pursued the two quoted questions above? Super curious if anyone has good results!
I don’t know of any work on these unfortunately. Your two finds look useful, though, especially the paper — thanks for linking!