Ann comments on Tracing the Thoughts of a Large Language Model

Ann 30 Mar 2025 23:55 UTC
3 points
0
Models do see data more than once. Experimental testing shows a certain amount of “hydration” (repeating data that is often duplicated in the training set) is beneficial to the resulting model; this has diminishing returns when it is enough to “overfit” some data point and memorize at the cost of validation, but generally, having a few more copies of something that has a lot of copies of it around actually helps out.

(Edit: So you can train a model on deduplicated data, but this will actually be worse than the alternative at generalizing.)