Jan Betley comments on Subliminal Learning Across Models

Jan Betley 26 Nov 2025 21:36 UTC
7 points
3

if the dataset is biased and many of these updates point in a loosely similar direction

Dataset might be “biased” in a way that corresponds to something in the Real World. For example, tweed cloaks are more popular in UK.

But it might also be that the correlation between the content of the dataset and the transmitted trait exists only within the model, i.e. depends on initial weight initialization and the training process. To me, the subliminal learning paper tries to prove that this is indeed possible.

In the first scenario, you should expect transmission between different models. In the second, you shouldn’t.

So it feels like these are actually different mechanisms.