Mateusz Bagiński comments on Tracing the Thoughts of a Large Language Model

Mateusz Bagiński 30 Mar 2025 12:19 UTC
2 points
0
Seeing some training data more than once would make the incentive to [have concepts that generalize OOD] weaker than if [they saw every possible training datapoint at most once], but this doesn’t mean that the latter is an incentive towards concepts that generalize OOD.
Though admittedly, we are getting into the discussion of where to place the zero point of “null OOD generalization incentive”.
Also, I haven’t looked into it, but it’s plausible to me that models actually do see some data more than once because there are a lot of duplicates on the internet. If your training data contains the entire English Wikipedia, nlab, and some math textbooks, then surely there’s a lot of duplicated theorems and exercises (not necessarily word-by-word, but it doesn’t have to be word-by-word).
But I realized there might be another flaw in my comment, so I’m going to add an ETA.
(If I’m misunderstanding you, feel free to elaborate, ofc.)
- Alex Gibson 30 Mar 2025 13:44 UTC
  4 points
  0
  Parent
  My model of why SAEs work well for the Anthropic analysis is that the concepts discussed are genuinely ‘sparse’ features. Like predicting ‘Rabbit’ on the next line is a discrete decision, and so is of the form SAEs model for. We expect these SAE features to generalize OOD, because the model probably genuinely has these sparse directions.
  Whereas for ‘contextual / vibes’ based features, the ground truth is not a sparse sum of discrete features. It’s a continuous summary of the text obtained by averaging representations over the sequence. In this case, SAEs exhibit feature splitting where they are able to model the continuous summary with sparser and sparser features by clustering texts from the dataset together in finer and finer divisions. This starts off canonical, but eventually the clusters you choose are not features of the model, but features of the dataset. And at this point the features are no longer robust OOD because they aren’t genuine internal model features, they are tiny clusters that emerge from the interaction between the model and the dataset.
  So in theory the model might have a direction corresponding to ‘harmful intent’, but the SAEs split the dataset into so many chunks that to recover ‘harmful intent’ you need to combine lots of SAE latents together. And the OOD behaviour arises from the SAE latents being unfaithful to the ground truth, not the model having poor OOD behaviour. Like SAE latents might be sufficiently fine that you can patch together chunks of the dataset to fit the train dataset, in a non-robust way.
  As for concepts that generalize OOD—I suppose it depends what is meant by OOD? Like is looking at a dataset the model wasn’t exposed to, but that it reasonably could have been, OOD? If so, the incentive for learning OOD robust concepts is that assuming that most text the model receives is novel, this text is OOD for the model, so if its concepts are only relevant to the text it has seen so far, it will perform poorly. You can also argue regularisation drives short description lengths, and thus generalising concepts. Whether a chunk of the training set is duplicates / similar is kind of irrelevant, because even if only 50% of text is novel, the novel text still provides the incentive for robust concepts.
- Ann 30 Mar 2025 23:55 UTC
  3 points
  0
  Parent
  Models do see data more than once. Experimental testing shows a certain amount of “hydration” (repeating data that is often duplicated in the training set) is beneficial to the resulting model; this has diminishing returns when it is enough to “overfit” some data point and memorize at the cost of validation, but generally, having a few more copies of something that has a lot of copies of it around actually helps out.
  
  (Edit: So you can train a model on deduplicated data, but this will actually be worse than the alternative at generalizing.)