Interesting. Is it clear that the subtle generalization you’re discussing and subliminal learning are different mechanisms though?
If we assume that every token during SFT gives a tiny nudge in a random direction, then for a “regular” dataset, these nudges all more or less cancel out. But if the dataset is biased and many of these updates point in a loosely similar direction, then their sum adds up to a large vector. In the original subliminal learning, these nudges can only loosely correlate to the target concept due to the text being numbers. In our setting, the nudges only loosely correlate to the target concept because we filter out all the strong correlations. The main difference is that for our setting, the updates’ correlation to the target is consistent across models (which doesn’t seem to be the case when the data is constrained to be strings of numbers).
But it feels like the mechanism is consistent, no?
if the dataset is biased and many of these updates point in a loosely similar direction
Dataset might be “biased” in a way that corresponds to something in the Real World. For example, tweed cloaks are more popular in UK.
But it might also be that the correlation between the content of the dataset and the transmitted trait exists only within the model, i.e. depends on initial weight initialization and the training process. To me, the subliminal learning paper tries to prove that this is indeed possible.
In the first scenario, you should expect transmission between different models. In the second, you shouldn’t.
So it feels like these are actually different mechanisms.
Interesting. Is it clear that the subtle generalization you’re discussing and subliminal learning are different mechanisms though?
If we assume that every token during SFT gives a tiny nudge in a random direction, then for a “regular” dataset, these nudges all more or less cancel out. But if the dataset is biased and many of these updates point in a loosely similar direction, then their sum adds up to a large vector. In the original subliminal learning, these nudges can only loosely correlate to the target concept due to the text being numbers. In our setting, the nudges only loosely correlate to the target concept because we filter out all the strong correlations. The main difference is that for our setting, the updates’ correlation to the target is consistent across models (which doesn’t seem to be the case when the data is constrained to be strings of numbers).
But it feels like the mechanism is consistent, no?
Dataset might be “biased” in a way that corresponds to something in the Real World. For example, tweed cloaks are more popular in UK.
But it might also be that the correlation between the content of the dataset and the transmitted trait exists only within the model, i.e. depends on initial weight initialization and the training process. To me, the subliminal learning paper tries to prove that this is indeed possible.
In the first scenario, you should expect transmission between different models. In the second, you shouldn’t.
So it feels like these are actually different mechanisms.